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Abstract 

Sparse estimation methods are aimed at using or obtaining parsimo- 
nious representations of data or models. They were first dedicated to 
linear variable selection but numerous extensions have now emerged 
such as structured sparsity or kernel selection. It turns out that many 
of the related estimation problems can be cast as convex optimiza- 
tion problems by regularizing the empirical risk with appropriate non- 
smooth norms. The goal of this paper is to present from a general per- 
spective optimization tools and techniques dedicated to such sparsity- 
inducing penalties. We cover proximal methods, block-coordinate de- 
scent, reweighted ^-penalized techniques, working-set and homotopy 
methods, as well as non-convex formulations and extensions, and pro- 
vide an extensive set of experiments to compare various algorithms 
from a computational point of view. 
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Introduction 



The principle of parsimony is central to many areas of science: the sim- 
plest explanation of a given phenomenon should be preferred over more 
complicated ones. In the context of machine learning, it takes the form 
of variable or feature selection, and it is commonly used in two situa- 
tions. First, to make the model or the prediction more interpretable or 
computationally cheaper to use, i.e., even if the underlying problem is 
not sparse, one looks for the best sparse approximation. Second, spar- 
sity can also be used given prior knowledge that the model should be 
sparse. 

For variable selection in linear models, parsimony may be directly 
achieved by penalization of the empirical risk or the log-likelihood by 
the cardinality of the support^ of the weight vector. However, this leads 
to hard combinatorial problems (see, e.g., [57 1 1137] ). A traditional con- 
vex approximation of the problem is to replace the cardinality of the 
support by the ^i-norm. Estimators may then be obtained as solutions 
of convex programs. 

Casting sparse estimation as convex optimization problems has 



We call the set of non-zeros entries of a vector the support. 
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two main benefits: First, it leads to efficient estimation algorithms — 
and this paper focuses primarily on these. Second, it allows a fruit- 
ful theoretical analysis answering fundamental questions related to es- 
timation consistency, prediction efficiency |19l UOOj or model consis- 
tency |145|, I158J . In particular, when the sparse model is assumed to 
be well-specified, regularization by the ^i-norm is adapted to high- 
dimensional problems, where the number of variables to learn from 
may be exponential in the number of observations. 

Reducing parsimony to finding the model of lowest cardinality 
turns out to be limiting, and structured parsimony [13 |65j E31 157] 
has emerged as a natural extension, with applications to computer 
vision (32] E21 EI] , text processing [BH] , bioinformatics [HS1 EI] or au- 
dio processing [81]. Structured sparsity may be achieved by penalizing 
other functions than the cardinality of the support or regularizing by 
other norms than the £i-norm. In this paper, we focus primarily on 
norms which can be written as linear combinations of norms on subsets 
of variables, but we also consider traditional extensions such as mul- 
tiple kernel learning and spectral norms on matrices (see Sections 11.31 
and ll.5p . One main objective of this paper is to present methods which 
are adapted to most sparsity-inducing norms with loss functions po- 
tentially beyond least-squares. 

Finally, similar tools are used in other communities such as signal 
processing. While the objectives and the problem set-ups are different, 
the resulting convex optimization problems are often similar, and most 
of the techniques reviewed in this paper also apply to sparse estimation 
problems in signal processing. Moreover, we consider in Section El non- 
convex formulations and extensions. 

This paper aims at providing a general overview of the main opti- 
mization techniques that have emerged as most relevant and efficient 
for methods of variable selection based on sparsity-inducing norms. We 
survey and compare several algorithmic approaches as they apply to 
the £i-norm, group norms, but also to norms inducing structured spar- 
sity and to general multiple kernel learning problems. We complement 
these by a presentation of some greedy and non-convex methods. Our 
presentation is essentially based on existing literature, but the process 
of constructing a general framework leads naturally to new results, 
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connections and points of view. 

This monograph is organized as follows: 

Sections 11.11 and 11.21 introduce respectively the notations used 
throughout the manuscript and the optimization problem (jl.ip which 
is central to the learning framework that we will consider. 

Section 11.31 gives an overview of common sparsity and structured 
sparsity-inducing norms, with some of their properties and examples 
of structures which they can encode. 

Section 11.41 provides an essentially self-contained presentation of 
concepts and tools from convex analysis that will be needed in the rest 
of the manuscript, and which are relevant to understand algorithms for 
solving the main optimization problem . Specifically since sparsity 
inducing norms are non-differentiable convex functional, we introduce 
relevant elements of subgradient theory and Fenchel duality — which 
are particularly well suited to formulate the optimality conditions as- 
sociated to learning problems regularized with these norms. We also 
introduce a general quadratic variational formulation for a certain class 
of norms in Section [T321 the part on subquadratic norms is essentially 
relevant in view of sections on structured multiple kernel learning and 
can safely be skipped in a first reading. 

Section 11.51 introduces multiple kernel learning (MKL) and shows 
that it can be interpreted as an extension of plain sparsity to repro- 
ducing kernel Hilbert spaces (RKHS), but formulated in the dual. This 
connection is further exploited in Section 11.5.21 where its is shown 
how structured counterparts of MKL can be associated with struc- 
tured sparsity-inducing norms. These sections rely on Section 11.4.21 
All sections on MKL can be skipped in a first reading. 

In Section [2J we discuss classical approaches to solving the opti- 
mization problem arising from simple sparsity-inducing norms, such 
as interior point methods and subgradient descent, and point at their 
shortcomings in the context of machine learning. 



2 Throughout this paper, we refer to sparsity-inducing norms such as the £i-norm as non- 
smooth norms; note that all norms are non-diffcrcntiable at zero, but some norms have 
more non-diffcrcntiability points (see more details in Section I l,3|l . 
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Section [3] is devoted to a simple presentation of proximal methods. 
After two short sections introducing the main concepts and algorithms, 
the longer Section 13.31 focusses on the proximal operator and presents 
algorithms to compute it for a variety of norms. Section 13.41 shows 
how proximal methods for structured norms extend naturally to the 
RKHS/MKL setting. 

Section 0] presents block coordinate descent algorithms, which pro- 
vide an efficient alternative to proximal method for separable norms 
like the l\- and l\i '/2-norms, and can be applied to MKL. This section 
uses the concept of proximal operator introduced in Section [3l 

Section [5] presents reweighted-^2 algorithms that are based on the 
quadratic variational formulations introduced in Section 11.4.21 These 
algorithms are particularly relevant for the least-squares loss, for which 
they take the form of iterative reweighted least-squares algorithms 
(IRLS). Section 15.21 presents a generally applicable quadratic varia- 
tional formulation for general norms that extends the variational for- 
mulation of Section 11.4.21 

Section covers algorithmic schemes that take advantage computa- 
tionally of the sparsity of the solution by extending the support of the 
solution gradually. These schemes are particularly relevant to construct 
approximate or exact regularization paths of solutions for a range of val- 
ues of the regularization parameter. Specifically, Section 16.11 presents 
working-set techniques, which are meta-algorithms that can be used 
with the optimization schemes presented in all the previous chapters. 
Section 16.21 focuses on the homotopy algorithm, which can efficiently 
construct the entire regularization path of the Lasso. 

Section [7] presents nonconvex as well as Bayesian approaches that 
provide alternatives to, or extensions of the convex methods that were 
presented in the previous sections. More precisely, Section I7TT1 presents 
so-called greedy algorithms, that aim at solving the cardinality con- 
strained problem and include matching pursuit, orthogonal matching 
pursuit and forward selection; Section 17.21 presents continuous opti- 
mization problems, in which the penalty is chosen to be closer to the 
so-called £o _ P ena lty (i.e., a penalization of the cardinality of the model 
regardless of the amplitude of the coefficients) at the expense of losing 
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convexity, and corresponding optimization schemes. Section 17.31 dis- 
cusses the application of sparse norms regularization to the problem 
of matrix factorization, which is intrinsically nonconvex, but for which 
the algorithms presented in the rest of this monograph are relevant. Fi- 
nally, we discuss briefly in Section T7.4I Bayesian approaches to sparsity 
and the relations to sparsity-inducing norms. 

Section [H] presents experiments comparing the performance of the 
algorithms presented in Sections [2j [3j [H [5j in terms of speed of con- 
vergence of the algorithms. Precisely, Section ED is devoted to the l\- 
regularization case, and Section 18.21 and 18.31 are respectively covering 
the ^i/£ p -norms with disjoint groups and to more general structured 
cases. 

We discuss briefly methods and cases which were not covered in the 
rest of the monograph in Section [9] and we conclude in Section [TU1 

Some of the material from this paper is taken from an earlier book 
chapter [12] and the dissertations of Rodolphe Jenatton [66] and Julien 
Mairal [86]. 

1.1 Notation 

Vectors are denoted by bold lower case letters and matrices by up- 
per case ones. We define for q > 1 the £ g -norm of a vector x in R n as 
ll^llg := (SILi where Xi denotes the i-th coordinate of x, and 

II^Hoo := maxj = i r .. in \xi\ = lim^oo ||a;|| g . We also define the ^-penalty 
as the number of nonzero elements in a vectorlf] ||a;||o := #{i s.t. Xi ^ 
0} = lim g _ !> o+ (Ya=i \ x i\ q )- We consider the Frobenius norm of a ma- 
trix X in R mxn : \\X\\ F := {J2T=i E"=i X?j) 1/2 , where X tj denotes the 
entry of X at row i and column j. For an integer n > 0, and for any 
subset J C {1, . . . , n}, we denote by xj the vector of size \J\ containing 
the entries of a vector x in W 1 indexed by J, and by Xj the matrix 
in ]R" 1X I ,7 I containing the \J\ columns of a matrix X in ]R mxn indexed 
by J. 



3 Note that it would be more proper to write ||a;||g instead of ||x||o to be consistent with the 
traditional notation 1 1 a? 1 1 ^ . However, for the sake of simplicity, we will keep this notation 
unchanged in the rest of the paper. 



6 Introduction 



1.2 Loss Functions 

We consider in this paper convex optimization problems of the form 



where / : MP — > K is a convex differentiable function and f2 : M p — > K is 
a sparsity-inducing — typically nonsmooth and non-Euclidean — norm. 

In supervised learning, we predict outputs y in V from obser- 
vations x in X; these observations are usually represented by p- 
dimensional vectors with X = M. p . In this supervised setting, / generally 
corresponds to the empirical risk of a loss function 1:^x1-^ M + . More 
precisely, given n pairs of data points {(sc®, y^ l >) £ Wx^; i = 1, . . . , n}, 
we have for linear modelaEI f(w) := ^ ^"=1 Kv^i w T x^). Typical 
examples of differentiable loss functions are the square loss for least 
squares regression, i.e., £(y, y) = \{y — y) 2 with y in R, and the logistic 
loss £(y,y) = log(l + e~ yy ) for logistic regression, with y in {—1,1}. 
Clearly, several loss functions of interest are non-differentiable, such 
as the hinge loss £(y, y) = (1 — yy)+ or the absolute deviation loss 
^(y^y) = 1 2/ — 2/| , for which most of the approaches we present in this 
monograph would not be applicable or require appropriate modifica- 
tions. Given the tutorial character of this monograph, we restrict our- 
selves to smooth functions /, which we consider is a reasonably broad 
setting, and we refer the interested reader to appropriate references in 
Section O We refer the readers to [127J for a more complete description 
of loss functions. 

Penalty or constraint? Given our convex data-fitting term f(w), 
we consider in this paper adding a convex penalty XQ(w). Within such a 
convex optimization framework, this is essentially equivalent to adding 
a constraint of the form O(iw) < fi. More precisely, under weak as- 
sumptions on / and f2 (on top of convexity), from Lagrange multiplier 
theory (see [20], Section 4.3) w is a solution of the constrained problem 
for a certain \x > if and only if it is a solution of the penalized prob- 
lem for a certain A > 0. Thus, the two regularization paths, i.e., the 





4 In Section 11.51 we consider extensions to non-linear predictors through multiple kernel 
learning. 
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set of solutions when A and 11 vary, are equivalent. However, there is 
no direct mapping between corresponding values of A and /x. Moreover, 
in a machine learning context, where the parameters A and \x have to 
be selected, for example through cross-validation, the penalized for- 
mulation tends to be empirically easier to tune, as the performance is 
usually quite robust to small changes in A, while it is not robust to 
small changes in /i. Finally, we could also replace the penalization with 
a norm by a penalization with the squared norm. Indeed, following the 
same reasoning as for the non-squared norm, a penalty of the form 
\£l(w) 2 is "equivalent" to a constraint of the form Q(w) 2 ^ fi, which 
itself is equivalent to Q(w) /x 1 / 2 , and thus to a penalty of the form 
\'tt(w) 2 , for A' 7^ A. Thus, using a squared norm, as is often done in the 
context of multiple kernel learning (see Section 1 1 . 5|) , does not change 
the regularization properties of the formulation. 

1.3 Sparsity-Inducing Norms 

In this section, we present various norms as well as their main sparsity- 
inducing effects. These effects may be illustrated geometrically through 
the singularities of the corresponding unit balls (see Figure [T7i|) . 

Sparsity through the fi-norm. When one knows a priori that the 
solutions w* of problem (jl.ip should have a few non-zero coefficients, 
is often chosen to be the £i-norm, i.e., Q(w) = Y^j=i \ w j\- This leads 
for instance to the Lasso |134| or basis pursuit [37] with the square loss 
and to ^i-regularized logistic regression (see, for instance, [73 1128] ) 
with the logistic loss. Regularizing by the £i-norm is known to induce 
sparsity in the sense that, a number of coefficients of w*, depending on 
the strength of the regularization, will be exactly equal to zero. 

£i/^g-norms. In some situations, the coefficients of w* are naturally 
partitioned in subsets, or groups, of variables. This is typically the case, 
when working with ordinal variable^- It is then natural to select or re- 

5 Ordinal variables are integer- valued variables encoding levels of a certain feature, such as 
levels of severity of a certain symptom in a biomedical application, where the values do not 
correspond to an intrinsic linear scale: in that case it is common to introduce a vector of 
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move simultaneously all the variables forming a group. A regularization 
norm exploiting explicitly this group structure, or l\- group norm, can 
be shown to improve the prediction performance and/or interpretabil- 
ity of the learned models [62l [Ml QUI HZ1 E21 [156] • The arguably 
simplest group norm is the so-called-£i /I2 norm: 

n(w):=Y J d g \\w g \\ 2 , (1.2) 

g&S 

where 9 is a partition of {1, . . . ,p}, (d g ) ge $ are some strictly positive 
weights, and w g denotes the vector in M) 9 ^ recording the coefficients 
of w indexed by g in 9- Without loss of generality we may assume 
all weights {d g ) g ^ to be equal to one (when S is a partition, we can 
rescale the values of w appropriately). As defined in Eq. (|1.2j) . O is 
known as a mixed £i/£2-norm. It behaves like an £i-norm on the vector 
(H^glh^eS m an d therefore, Q induces group sparsity. In other 
words, each ||u> 9 ||2, and equivalently each w g , is encouraged to be set 
to zero. On the other hand, within the groups g in 9, the ^-norm 
does not promote sparsity. Combined with the square loss, it leads to 
the group Lasso formulation [142~j 1156] . Note that when 9 is the set of 
singletons, we retrieve the ^i-norm. More general mixed £i/^ g -norms 
for q > 1 are also used in the literature [157] (using q = 1 leads to a 
weighted £i-norm with no group-sparsity effects): 

r 1/9 

V{w) = Y,W w g\\q := ZX{ ^\ w j\ 9 \ ■ 

963 S GS ^ j&g } 

In practice though, the Ixjli- and £i/£oo -settings remain the most pop- 
ular ones. Note that using -norms may have the undesired effect to 
favor solutions w with many components of equal magnitude (due to 
the extra non-differentiabilities away from zero). Grouped ^i-norms are 
typically used when extra-knowledge is available regarding an appropri- 
ate partition, in particular in the presence of categorical variables with 
orthogonal encoding |117j . for multi-task learning where joint variable 
selection is desired |1U7| . and for multiple kernel learning (see Sec- 
tion [L5|). 



binary variables, each encoding a specific level of the symptom, that encodes collectively 
this single feature. 
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Fig. 1.1: (Left) The set of blue groups to penalize in order to select 
contiguous patterns in a sequence. (Right) In red, an example of such 
a nonzero pattern with its corresponding zero pattern (hatched area). 

Norms for overlapping groups: a direct formulation. In an 

attempt to better encode structural links between variables at play 
(e.g., spatial or hierarchical links related to the physics of the problem 
at hand), recent research has explored the setting where 9 in Eq. (|1.2p 
can contain groups of variables that overlap E7J EH H^lj I157j . 
In this case, if the groups span the entire set of variables, f2 is still a 
norm, and it yields sparsity in the form of specific patterns of variables. 
More precisely, the solutions w* of problem (II. ID can be shown to have 
a set of zero coefficients, or simply zero pattern, that corresponds to 
a union of some groups g in 9 [HZ]- This property makes it possible 
to control the sparsity patterns of w* by appropriately defining the 
groups in 9- Note that here the weights d g should not be taken equal 
to one (see, [67] for more details). This form of structured sparsity has 
notably proven to be useful in various contexts, which we now illustrate 
through concrete examples: 

- One-dimensional sequence: Given p variables organized 
in a sequence, if we want to select only contiguous nonzero 
patterns, we represent in Figure 11.11 the set of groups 9 to 
consider. In this case, we have |9| = 0(p). Imposing the 
contiguity of the nonzero patterns is for instance relevant in 
the context of time series, or for the diagnosis of tumors, 
based on the profiles of arrayCGH |113j . Indeed, because 
of the specific spatial organization of bacterial artificial 
chromosomes along the genome, the set of discriminative 
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Fig. 1.2: Vertical and horizontal groups: (Left) the set of blue and green 
groups to penalize in order to select rectangles. (Right) In red, an exam- 
ple of nonzero pattern recovered in this setting, with its corresponding 
zero pattern (hatched area). 



features is expected to have specific contiguous patterns. 

- Two-dimensional grid: In the same way, assume now that 
the p variables are organized on a two-dimensional grid. If 
we want the possible nonzero patterns !P to be the set of all 
rectangles on this grid, the appropriate groups 9 to consider 
can be shown (see [67J) to be those represented in Figure [L"2l 
In this setting, we have |9| = 0{y/p). Sparsity-inducing regu- 
larizations built upon such group structures have resulted in 
good performances for background subtraction [63 | IHT ] [89] . 
topographic dictionary learning [731 EH], wavelet-based 
denoising |112| . and for face recognition with corruption by 
occlusions [71] . 

- Hierarchical structure: A third interesting example as- 
sumes that the variables have a hierarchical structure. Specif- 
ically, we consider that the p variables correspond to the 
nodes of tree 7 (or a forest of trees). Moreover, we assume 
that we want to select the variables according to a certain 
order: a feature can be selected only if all its ancestors in 7 
are already selected. This hierarchical rule can be shown to 
lead to the family of groups displayed on Figure 11.31 

This resulting penalty was first used in [157] : since then, 
this group structure has led to numerous applications, 
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Fig. 1.3: Left: example of a tree-structured set of groups 9 (dashed 
contours in red), corresponding to a tree Twithp = 6 nodes represented 
by black circles. Right: example of a sparsity pattern induced by the 
tree-structured norm corresponding to S; the groups {2, 4}, {4} and 
{6} are set to zero, so that the corresponding nodes (in gray) that form 
subtrees of T are removed. The remaining nonzero variables {1,3,5} 
form a rooted and connected subtree of T. This sparsity pattern obeys 
the following equivalent rules: (i) if a node is selected, the same goes 
for all its ancestors; (ii) if a node is not selected, then its descendant 
are not selected. 

for instance, wavelet-based denoising [T51 E3 EH H57j . 
hierarchical dictionary learning for both topic modeling and 
image restoration [69, 70J, log- linear models for the selection 
of potential orders of interaction in a probabilistic graphical 
model [121] , bioinformatics, to exploit the tree structure of 
gene networks for multi-task regression [73], and multi-scale 
mining of fMRI data for the prediction of some cognitive 
task |68j . More recently, this hierarchical penalty was proved 
to be efficient for template selection in natural language 
processing [U5] . 

- Extensions: The possible choices for the sets of groups S 
are not limited to the aforementioned examples. More com- 
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plicated topologies can be considered, for instance, three- 
dimensional spaces discretized in cubes or spherical volumes 
discretized in slices; for instance, sec [143J for an application 
to neuroimaging that pursues this idea. Moreover, directed 
acyclic graphs that extends the trees presented in Figure [L3l 
have notably proven to be useful in the context of hierarchi- 
cal variable selection P 1121} 1157] , 

Norms for overlapping groups: a latent variable formulation. 

The family of norms defined in Eq. (|1.2|) is adapted to intersection- 
closed sets of nonzero patterns. However, some applications exhibit 
structures that can be more naturally modelled by union- closed families 
of supports. This idea was developed in [65l 1106] where, given a set of 
groups Sj the following latent group Lasso norm was proposed: 



bunion {w ) := min } d g \\v 9 \\ q s.t. 



Ege 3 v9 = w > 



veRpxlsl z_, [ \/g G S, Vj =0 ii j £ g. 

The idea is to introduce latent parameter vectors v 9 constrained each 
to be supported on the corresponding group g, which should explain w 
linearly and which are themselves regularized by a usual £i/£q-noim. 
^union reduces to the usual i\jl q norm when groups are disjoint and 
provides therefore a different generalization of the latter to the case 
of overlapping groups than the norm considered in the previous para- 
graphs. In fact, it is easy to see that solving Eq. (jl.ip with the norm 
^union is equivalent to solving 



(«9eMi9i) 9 eS n 



n 



mm 

can ^ — 4 x " 

36S geS 



and setting w = J2geS v9 - This last equation shows that using the 
norm r2 un i on can be interpreted as implicitly duplicating the variables 
belonging to several groups and regularizing with a weighted l\jl q norm 
for disjoint groups in the expanded space. It should be noted that a 
careful choice of the weights is much more important in the situation of 
overlapping groups than in the case of disjoint groups, as it influences 
possible sparsity patterns [106J. 
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This latent variable formulation pushes some of the vectors v 9 to 
zero while keeping others with no zero components, hence leading to 
a vector w with a support which is in general the union of the se- 
lected groups. Interestingly, it can be seen as a convex relaxation of 
a non-convex penalty encouraging similar sparsity patterns which was 
introduced by [63]. Moreover, this norm can also be interpreted as a 
particular case of the family of atomic norms, which were recently in- 
troduced by [35J. 

Graph Lasso. One type of a priori knowledge commonly encoun- 
tered takes the form of graph defined on the set of input variables, 
which is such that connected variables are more likely to be simultane- 
ously relevant or irrelevant; this type of prior is common in genomics 
where regulation, co-expression or interaction networks between genes 
(or their expression level) used as predictors are often available. To 
favor the selection of neighbors of a selected variable, it is possible to 
consider the edges of the graph as groups in the previous formulation 
(see [B51HI2]). 

Patterns consisting of a small number of intervals. A quite similar 
situation occurs, when one knows a priori — typically for variables form- 
ing sequences (times series, strings, polymers) — that the support should 
consist of a small number of connected subsequences. In that case, one 
can consider the sets of variables forming connected subsequences (or 
connected subsequences of length at most k) as the overlapping groups. 



Multiple kernel learning. For most of the sparsity-inducing terms 
described in this paper, we may replace real variables and their absolute 
values by pre-defined groups of variables with their Euclidean norms 
(we have already seen such examples with ^i/^-norms), or more gen- 
erally, by members of reproducing kernel Hilbert spaces. As shown in 
Section 11.51 most of the tools that we present in this paper are appli- 
cable to this case as well, through appropriate modifications and bor- 
rowing of tools from kernel methods. These tools have applications in 
particular in multiple kernel learning. Note that this extension requires 
tools from convex analysis presented in Section 11.41 
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Trace norm. In learning problems on matrices, such as matrix com- 
pletion, the rank plays a similar role to the cardinality of the support 
for vectors. Indeed, the rank of a matrix M may be seen as the num- 
ber of non-zero singular values of M. The rank of M however is not a 
continuous function of M, and, following the convex relaxation of the 
^o-pseudo-norm into the ^i-norm, we may relax the rank of M into the 
sum of its singular values, which happens to be a norm, and is often 
referred to as the trace norm or nuclear norm of M, and which we 
denote by ||M||*. As shown in this paper, many of the tools designed 
for the ^i-norm may be extended to the trace norm. Using the trace 
norm as a convex surrogate for rank has many applications in control 
theory matrix completion [TJ 1131 j . multi-task learning [110] . or 
multi-label classification [3], where low-rank priors are adapted. 

Sparsity-inducing properties: a geometrical intuition. Al- 
though we consider in Eq. (|1.1|) a regularized formulation, as already 
described in Section 11.21 we could equivalently focus on a constrained 
problem, that is, 

min f(w) such that Q(w) < u, (1-4) 

for some \i E M + . The set of solutions of Eq. (|1.4fl parameterized by \x 
is the same as that of Eq. (II. lh . as described by some value of 
depending on \x (e.g., see Section 3.2 in [2U]). At optimality, the gradient 
of / evaluated at any solution w of (jl.4p is known to belong to the 
normal cone of 23 = {w G W; 0,(w) < /i} at w [20J. In other words, 
for sufficiently small values of i.e., so that the constraint is active, 
the level set of / for the value /(to) is tangent to 23. 

As a consequence, the geometry of the ball 23 is directly related 
to the properties of the solutions w. If £1 is taken to be the ^2-norm, 
then the resulting ball 23 is the standard, isotropic, "round" ball that 
does not favor any specific direction of the space. On the other hand, 
when Q is the ^i-norm, 23 corresponds to a diamond-shaped pattern in 
two dimensions, and to a pyramid in three dimensions. In particular, 23 
is anisotropic and exhibits some singular points due to the extra non- 
smoothness of f2. Moreover, these singular points are located along the 
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axis of MP, so that if the level set of / happens to be tangent at one 
of those points, sparse solutions are obtained. We display in Figure [T~4l 
the balls S for the £\-, £2-, and two different grouped ^i/^-norms. 

Extensions. The design of sparsity-inducing norms is an active field 
of research and similar tools to the ones we present here can be derived 
for other norms. As shown in Section [3l computing the proximal op- 
erator readily leads to efficient algorithms, and for the extensions we 
present below, these operators can be efficiently computed. 

In order to impose prior knowledge on the support of predictor, the 
norms based on overlapping ^i/^oo-norms can be shown to be convex 
relaxations of submodular functions of the support, and further ties can 
be made between convex optimization and combinatorial optimization 
(see [TO] for more details). Moreover, similar developments may be car- 
ried through for norms which try to enforce that the predictors have 
many equal components and that the resulting clusters have specific 
shapes, e.g., contiguous in a pre-defined order, see some examples in 
Section O and, e.g., [TT ] [33 | IHT ] 1135} 1144] and references therein. 

1.4 Optimization Tools 

The tools used in this paper are relatively basic and should be accessible 
to a broad audience. Most of them can be found in classical books 
on convex optimization [IBJ [20J [25j 1105] . but for self-containedness, 
we present here a few of them related to non-smooth unconstrained 
optimization. In particular, these tools allow the derivation of rigorous 
approximate optimality conditions based on duality gaps (instead of 
relying on weak stopping criteria based on small changes or low-norm 
gradients). 

Subgradients. Given a convex function g : R p — > R and a vector w 
in R p , let us define the subdifferential of g at w as 

dg(w) := {z € R p | g(w)+z J (w'-w) < g(w') for all vectors w' £ M p }. 

The elements of dg(w) are called the subgradients of g at w. Note that 
all convex functions defined on MP have non-empty sub differentials at 
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(c) Sl un ion ball for (f) f1 union ball for 

S = {{1,3}, {2, 3}}. 3 = {{1,3}, {2, 3}, {1,2}}. 

Fig. 1.4: Comparison between different balls of sparsity- inducing norms 
in three dimensions. The singular points appearing on these balls de- 
scribe the sparsity-inducing behavior of the underlying norms £1. 
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every point. This definition admits a clear geometric interpretation: 
any subgradient z in dg{w) defines an affine function w' i— > g(w) + 
z T (w' — w) which is tangent to the graph of the function g (because of 
the convexity of g, it is a lower-bounding tangent). Moreover, there is 
a bijection (one-to-one correspondence) between such "tangent affine 
functions" and the subgradients, as illustrated in Figure 11.51 




(a) Smooth case. (b) Non-smooth case. 



Fig. 1.5: Red curves represent the graph of a smooth (left) and a non- 
smooth (right) function /. Blue affine functions represent subgradients 
of the function / at a point w . 

Subdifferentials are useful for studying nonsmooth optimization 
problems because of the following proposition (whose proof is straight- 
forward from the definition): 



Proposition 1.1 (Subgradients at Optimality). 

For any convex function g : W — > R, a point w in MP is a global 
minimum of g if and only if the condition G dg{w) holds. 



Note that the concept of subdifferential is mainly useful for nonsmooth 
functions. If g is differentiable at w, the set dg{w) is indeed the single- 
ton {X7g(w)}, where Vg(w) is the gradient of g at w, and the condi- 
tion G dg{w) reduces to the classical first-order optimality condition 
Vg(w) = 0. As a simple example, let us consider the following opti- 
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Applying the previous proposition and noting that the subdifferential 
d\ ■ | is {+1} for w > 0, {—1} for w < and [—1, 1] for w = 0, it is 
easy to show that the unique solution admits a closed form called the 
soft-thresholding operator, following a terminology introduced in |43j ; 
it can be written 



or equivalently w* = sign(x)(|x| — A) + , where sign(x) is equal to 1 if 
x > 0, — 1 if x < and if x = 0. This operator is a core component 
of many optimization techniques for sparse estimation, as we shall see 
later. Its counterpart for non-convex optimization problems is the hard- 
thresholding operator. Both of them are presented in Figure [L6l Note 
that similar developments could be carried through using directional 
derivatives instead of subgradients (see, e.g., |20j). 




(1.5) 



w 




w 




X 



-V2X 




X 



x 




V2X 



X 



(a) soft-thresholding operator, 
w* = sign(x)(|x| — A) + , 
vain w i(x — to) 2 + X\w\. 



(b) hard-thresholding operator 

W * = 1 \ X \>V2X X 
min w \{x - w) 2 + Al| w |>o- 



Fig. 1.6: Soft- and hard-thresholding operators. 
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Dual norm and optimality conditions. The next concept we in- 
troduce is the dual norm, which is important to study sparsity-inducing 
regularizations K>7| HUUj . It notably arises in the analysis of estima- 
tion bounds [100 J , and in the design of working-set strategies as will be 
shown in Section 16.11 The dual norm f2* of the norm f2 is defined for 
any vector z in MP by 

£l*(z) := max z T w such that Q(w) < 1. (1.6) 

w&Rp 

Moreover, the dual norm of Q* is O itself, and as a consequence, the 
formula above holds also if the roles of f2 and f2* are exchanged. It is 
easy to show that in the case of an £ q -novm, q G [1; +oo], the dual norm 
is the -^'-norni, with q' in [1; +oo] such that ~+^r = 1. In particular, the 
t\- and £oo-norms are dual to each other, and the ^-norm is self-dual 
(dual to itself). 

The dual norm plays a direct role in computing optimality condi- 
tions of sparse regularized problems. By applying Proposition 11.11 to 
Eq. (jl.ip . we obtain the following proposition: 

Proposition 1.2 (Optimality Conditions for Eq. (II. ip ). Let us 

consider problem (jl.ip where Q is a norm on M p . A vector w in R p 
is optimal if and only if — jVf(w) G dQ(w) with 

f{zeW; n*(z)<l}ifw = 0, 
dU(w) = < 

{{z G W; n*(z) = 1 and z T w = n(w)} otherwise. 

(1.7) 



Computing the sub differential of a norm is a classical course exer- 
cise [20j and its proof will be presented in the next section, in Re- 
mark 11.11 As a consequence, the vector is solution if and only if 
f2*(V/(0)) < A. Note that this shows that for all A larger than 
Q,* ( V/(0)) , w = is a solution of the regularized optimization problem 
(hence this value is the start of the non-trivial regularization path). 

These general optimality conditions can be specialized to the Lasso 
problem [134], also known as basis pursuit [37j : 

min — \\y — Xtwlln + Alliolli, (1-8) 
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where y is in M. n , and X is a design matrix in M nX;p . With Eq. (jl.7p in 
hand, we can now derive necessary and sufficient optimality conditions: 



Proposition 1.3 (Optimality Conditions for the Lasso). 

A vector w \s a solution of the Lasso problem (jl.8p if and only if 



where Xj denotes the j-th column of X , and Wj the j-th entry of w. 

Proof. We apply Proposition 11.21 The condition — jVf(w) G f?||tw||i 
can be rewritten: X J (y — Xw) € raA9||i<;||i, which is equivalent to: 
(i) if w = 0, ||X T (y — Xtw)||oo < nX (using the fact that the loo- 
norm is dual to the ^i-norm); (ii) if w 7^ 0, ||X T (y — JTk^Hoo = nX 
and w T X T (y — Xw) = nA||iw||i. It is then easy to check that these 
conditions are equivalent to Eq. (|1.9p . □ 

As we will see in Section I6.2) it is possible to derive from these condi- 
tions interesting properties of the Lasso, as well as efficient algorithms 
for solving it. We have presented a useful duality tool for norms. More 
generally, there exists a related concept for convex functions, which we 
now introduce. 

1.4.1 Fenchel Conjugate and Duality Gaps 

Let us denote by /* the Fenchel conjugate of / [116] . defined by 



Fenchel conjugates are particularly useful to derive dual problems and 
duality gapqj. Under mild conditions, the conjugate of the conjugate of 
a convex function is itself, leading to the following representation of / 
as a maximum of affine functions: 




(1.9) 



f*(z) := sup [z w - f(w)]. 



f{w) = sup [z T w - f*(z)]. 



z&RP 



6 



For many of our norms, conic duality tools would suffice (see, e.g., 25 ). 
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In the context of this tutorial, it is notably useful to specify the expres- 
sion of the conjugate of a norm. Perhaps surprisingly and misleadingly, 
the conjugate of a norm is not equal to its dual norm, but corresponds 
instead to the indicator function of the unit ball of its dual norm. More 
formally, let us introduce the indicator function tn* such that lq* (z) is 
equal to if 0,*(z) < 1 and +oo otherwise. Then, we have the follow- 
ing well-known results, which appears in several text books (e.g., see 
Example 3.26 in [25]): 

Proposition 1.4 (Fenchel Conjugate of a Norm). Let O be a 

norm on M. p . The following equality holds for any zel p 

r T /vi / n f° if < 1 

SUp [Z W - il(w)\ = LQ*(W) = < 

w£Rp +oo otherwise. 



Proof. On the one hand, assume that the dual norm of z is greater than 
one, that is, f2*(z) > 1. According to the definition of the dual norm 
(see Eq. (|1.6p ). and since the supremum is taken over the compact set 
{w € M p ; £l(w) < 1}, there exists a vector w in this ball such that 
= z T w > 1. For any scalar t > 0, consider v = tw and notice 

that 

z T v - n(v) = t[z T w - n(w)] > t, 

which shows that when $l*(z) > 1, the Fenchel conjugate is unbounded. 
Now, assume that Q*(z) < 1. By applying the generalized Cauchy- 
Schwartz's inequality, we obtain for any w 

z T w - n(w) < tt*(z) Q(w) - fi(te) < 0. 

Equality holds for w = 0, and the conclusion follows. □ 

An important and useful duality result is the so-called Fenchel- Young 
inequality (see |20j). which we will shortly illustrate geometrically: 

Proposition 1.5 (Fenchel- Young Inequality). Let w be a vector 
in W, f be a function on W, and z be a vector in the domain of /* 
(which we assume non-empty). We have then the following inequality 

f(w) + t(z)>w T z, 
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with equality if and only if z is in df(w) 



We can now illustrate geometrically the duality principle between a 
function and its Fenchel conjugate in Figure [T.4.H 

Remark 1.1. With Proposition 11.41 in place, we can formally (and 
easily) prove the relationship in Eq. (jl.7p that make explicit the subd- 
ifferential of a norm. Based on Proposition [T31 we indeed know that the 
conjugate of 0, is tr>*. Applying the Fenchel- Young inequality (Propo- 
sition [L5]), we have 



z G d£l(w) 44> z w = tt(w) + in* (z) 
which leads to the desired conclusion. 

For many objective functions, the Fenchel conjugate admits closed 
forms, and can therefore be computed efficiently [20] . Then, it is pos- 
sible to derive a duality gap for problem (jl.l|) from standard Fenchel 
duality arguments (see [20]), as shown in the following proposition: 

Proposition 1.6 (Duality for Problem (11.11) ). 

If /* and £1* are respectively the Fenchel conjugate of a convex and 
differentiable function / and the dual norm of f2, then we have 

max -f*(z) < min f(w) + Xfl(w). (1.10) 

Moreover, equality holds as soon as the domain of / has non-empty 
interior. 

Proof. This result is a specific instance of Theorem 3.3.5 in |20j . In par- 
ticular, we use the fact that the conjugate of a norm Q is the indicator 
function lq* of the unit ball of the dual norm f2* (see Proposition II. 4p . 

□ 

If w* is a solution of Eq. (jl.ip . and w, z in MP are such that Q*(z) < A, 
this proposition implies that we have 

/(w) + xn( w ) > f(w*) + \n(w*) > -f*(z). (i.ii) 
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w*(z) y 








V : + w J z 




C = -f*(z) 



(a) Fenchel conjugate, tangent hyperplanes and subgradients. 





/f(w) 


W T Z 3 -f*(z3Y\ 


v >^ f z 1 -r(z 1 ) 




\^ w T z 2 -f*(z 2 ) 



(b) The graph of / is the envelope of the tangent hyperplanes T(z). 

Fig. 1.7: For all z in W, we denote by 7{z) the hyperplane with nor- 



mal z and tangent to the graph of the convex function /. Fig. (a) : For 
any contact point between the graph of / and an hyperplane 'P(z), we 
have that f(w) + f*(z) = w T z and z is in df(w) (the Fenchel- Young 



inequality is an equality). Fig. (b) ; the graph of / is the convex envelope 
of the collection of hyperplanes C?(z)) z£ m.p ■ 
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The difference between the left and right term of Eq. (jl.lip is called a 
duality gap. It represents the difference between the value of the primal 
objective function f(w) + XQ(w) and a dual objective function —f*(z), 
where z is a dual variable. The proposition says that the duality gap for 
a pair of optima w* and z* of the primal and dual problem is equal to 0. 
When the optimal duality gap is zero one says that strong duality holds. 
In our situation, the duality gap for the pair of primal/dual problems in 
Eq. (jl.lOp . may be decomposed as the sum of two non-negative terms 
(as the consequence of Fenchel- Young inequality) : 

(f(w) + f*(z) - w T z) + X(n(w) + w T (z/X) + l u *(z/X)) . 

It is equal to zero if and only if the two terms are simultaneously equal 
to zero. 

Duality gaps are important in convex optimization because they 
provide an upper bound on the difference between the current value of 
an objective function and the optimal value, which makes it possible to 
set proper stopping criteria for iterative optimization algorithms. Given 
a current iterate w, computing a duality gap requires choosing a "good" 
value for z (and in particular a feasible one). Given that at optimality, 
z(w*) = V/(iw*) is the unique solution to the dual problem, a natural 
choice of dual variable is z = min (l, fpr^jj^yj) Vf(w), which reduces 
to z(w*) at the optimum and therefore yields a zero duality gap at 
optimality. 

Note that in most formulations that we will consider, the function / 
is of the form f(w) = tp(Xw) with ip : R n — > M. and X a design 
matrix. Indeed, this corresponds to linear prediction on MP, given n 
observations Xi, i = 1, . . . , n, and the predictions Xw = (w Xi)i = i ... n . 
Typically, the Fenchel conjugate of ifi is easy to comput^ while the 
design matrix X makes it harcjl to compute /*. In that case, Eq. (jl.ip 
can be rewritten as 

min ib(u) + A Q,(w) s.t. u = Xw, (1.12) 

7 For the least-squares loss with output vector y £ M. n , we have ip(u) = -^\\y — tt||| and 
= ill/311! + P^D- For tnc logistic loss, we have i>(u) = J2i=i Io s(l + exp(— j/jtii)) 

and i/>*(/3) = E?=i(l + &Vi) lo g(! + PiVi) - PiVi log(-/3 l2/l ) if Vi, -ftj/i G [0, 1] and 

+00 otherwise. 
8 It would require to compute the pseudo-inverse of X. 
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and equivalently as the optimization of the Lagrangian 

min max ip(u) + XQ(w) + Aq t (Xw — u) , 

min max (ibiu) — \ot T u) + A (Cl(w) + cx T Xw) , (1.13) 

which is obtained by introducing the Lagrange multiplier a: for the 
constraint u = Xw. The corresponding Fenchel duaj^l is then 

max -ib*(\oc) such that n*(X T a) < 1, (1.14) 

Q6l" 

which does not require any inversion of X T X (which would be required 
for computing the Fenchel conjugate of /). Thus, given a candidate w, 
we consider a = min (l, Q , ^ X T Vtp(Xw)) ) ^^{Xw) , and can get an up- 
per bound on optimality using primal (|1.12|) and dual (|1.14fl problems. 
Concrete examples of such duality gaps for various sparse regularized 
problems are presented in appendix D of [86], and are implemented in 
the open-source software SPAM^"^!. which we have used in the experi- 
mental section of this paper. 

1.4.2 Quadratic Variational Formulation of Norms 

Several variational formulations are associated with norms, the most 
natural one being the one that results directly from (|1.6|) applied to 
the dual norm: 

VLiw) = maxu) T z s.t. VL*(z) < 1. 

However, another type of variational form is quite useful, especially for 
sparsity-inducing norms; among other purposes, as it is obtained by a 
variational upper-bound (as opposed to a lower-bound in the equation 
above), it leads to a general algorithmic scheme for learning problems 
regularized with this norm, in which the difficulties associated with 
optimizing the loss and that of optimizing the norm are partially de- 
coupled. We present it in Section [5j We introduce this variational form 
first for the t\- and ^1/^2-norms and subsequently generalize it to norms 
that we call subquadratic norms. 



9 Fenchel conjugacy naturally extends to this case; see Theorem 3.3.5 in 121)1 for more details. 

10 http : //www . di . ens . f r/willow/SPAMS/ 
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The case of the l\- and fi/^-norms. The two basic variational 
identities we use are, for a, b > 0, 



2ab 



inf 7] l a 2 + 7] b l 



(1.15) 



where the infimum is attained at r/ = a/b, and, for a £ 



\ 2 a 2 



»|S(RI)P 



(1.16) 



The last identity is a direct consequence of the Cauchy- Schwartz in- 
equality: 



P p P 2 i /o p 



i=l 



i=l 



«=i 



(1.17) 



«=i 



The infima in the previous expressions can be replaced by a minimiza- 
tion if the function q : 1R x M + — > R + with q(x, y) = ^- is extended in 



(0,0) using the convention "0/0=0", since the resulting functional is a 
proper closed convex function. We will use this convention implicitly 
from now on. The minimum is then attained when equality holds in 
the Cauchy-Schwartz inequality, that is for ^Jffi oc aj/y^, which leads 
to r]i = pT- if a ^ and else. 

Introducing the simplex A p = {rj £ \ ^2i=iVi = 1}' we a PPly 
these variational forms to the t\- and ^i/^-norms (with non overlap- 
ping groups) with Htwll^/^ = ^C 9 e3 \\ w g\\2 an d |S| = m, so that we 
obtain directly: 



p 2 
wf 



\w\U = mm - > — - + rij 
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W 



mm 



w 
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Sll2 



Quadratic variational forms for subquadratic norms. The 

variational form of the ^i-norm admits a natural generalization for 



1 This extension is in fact the function q : (x, y) \— > min |t £ R_)_ | 



t x 
x y 



0}. 
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certain norms that we call subquadratic norms. Before we introduce 
them, we review a few useful properties of norms. In this section, we 
will denote \w\ the vector . . . , \w p \). 



Definition 1.1 (Absolute and monotonic norm). We say that: 

• A norm is absolute if for all v G W, Q(v) = 0,(\v\). 

• A norm Q is monotonic if for all v,w G W s.t. \v{\ < 
\wi\, i = 1, . . . ,p, it holds that Q(v) < Q(w). 



These definitions are in fact equivalent (see, e.g., [16J): 



Proposition 1.7. A norm is monotonic if and only if it is absolute. 



Proof. If Q is monotonic, the fact that = implies Q(v) = £l(\v\) 
so that £1 is absolute. 

If is absolute, we first show that £1* is absolute. Indeed, 

fi*(*e) = max wk = max |w| t |k| = f2*(|/d). 

wmp, o.(\w\)<i weRp, n(\w\)<i 

Then if \v\ < \w\, since Q*(k) = Q*(\k\), 

n(v) = max |i?| t |k| < max |iu| T |/d = Q(w). 

K6KP,fl*(|*e|)<l »eeEP,n*(|*e|)<l 
which shows that is monotonic. □ 

We now introduce a family of norms, which have recently been 
studied in [Mj • 

Definition 1.2 (H-norm). Let H be a compact convex subset of 
such that H n (1R^_) P / 0, we say that Qh is an ii-norm if Qh(w) = 

Ep w f 
i=i 

The next proposition shows that is indeed a norm and characterizes 
its dual norm. 
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Proposition 1.8. £Ih is a norm and Q* h (k) 2 = max^gg ^J =1 rji nf. 



Proof. First, since H contains at least one element whose components 
are all strictly positive, f2 is finite on W. Symmetry, nonnegativity and 
homogeneity of Q# are straightforward from the definitions. Definite- 
ness results from the fact that H is bounded. is convex, since it is 
obtained by minimization of t] in a jointly convex formulation. Thus 
Qh is a norm. Finally, 

\vl* h (k) 2 = maxffi T K-i!)/f(w) 2 

2 weMP 2 

= max maxw T ft — -■u> T Diag(«)~ 1 «;. 
The form of the dual norm follows by maximizing w.r.t. w. □ 
We finally introduce the family of norms that we call subquadratic. 



Definition 1.3 (Subquadratic Norm). Let f2 and Q* a pair of ab- 
solute dual norms. Let fl* be the function defined as Si* : n 4 
[J1*(|k| 1 / 2 )] 2 where we use the notation (k) 1 / 2 = I 1 / 2 , . . . , \k p \ 1 / 2 ) t . 
We say that Q, is subquadratic if 0* is convex. 



With this definition, we have: 



Lemma 1.1. If Q is subquadratic, then Q* is a norm, and denoting f2 
the dual norm of the latter, we have: 

1 w 2 

Q(w) = - min ^ — + QM 

2 v eR p + ^ m 

2 

n(w) 2 = min V ^ where H = {n £ R p , I fi(n) < 1). 



Proof. Note that by construction, f2* is homogeneous, symmetric and 
definite (fi*(/e) = =>■ k = 0). If Cl* is convex then fi*(|(t> + «)) < 
+ fi*(u)), which by homogeneity shows that 17* also satisfies 
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the triangle inequality. Together, these properties show that Q* is a 
norm. To prove the first identity we have, applying (j!.15|) . and since Q 
is absolute, 

Q(w) = maxK T |«;| s.t. Q*(k)<1 

V 

= max Vk, 1/2 K| s.t. Vt*(n 1 ' 2 ) 2 <1 
1 p W 2 

= max min - >^ — - + K, T r> s.t. 0,*(k) < 1 
K eR^ r,m p + 2 ~[ Vi 

= min max - >^ — - + n T r) s.t. 0,*(k) < 1, 

which proves the first variational formulation (note that we can switch 
the order of the max and min operations because strong duality holds, 
which is due to the non-emptiness of the unit ball of the dual norm). 
The second one follows similarly by applying (|1.16p instead of (|1.15p . 

p 

n(w) 2 = max (V k) /2 \wA) 2 s.t. Win 1 / 2 ) 2 < 1 
+ i=i 

P 2 



aJ2^- s-t. 2> = l,ft»<l 



max mm 

P 2 
W 



max mm 



^ — s.t. rj T K= 1, fi*(«) < 1. 



□ 



Thus, given a sub quadratic norm, we may define a convex set H, 
namely the intersection of the unit ball of with the positive orthant 

2 

such that £l(w) 2 = min^g^ Yli=i a subquadratic norm is 

an //-norm. We now show that these two properties are in fact equiv- 
alent. 



Proposition 1.9. 0, is subquadratic if and only if it is an iZ-norm. 
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Proof. The previous lemma show that subquadratic norms are H- 
norms. Conversely, let £Ih be an ii-norm. By construction, Qji is 
absolute, and as a result of Prop. OJ Q? H (w) = (^(H 1 / 2 )) 2 = 
max^^H X^i Vi\ w i\> which shows that &* H is a convex function, as a 
maximum of convex functions. □ 

It should be noted that the set H leading to a given //-norm {1h 
is not unique; in particular H is not necessarily the intersection of the 
unit ball of a norm with the positive orthant. Indeed, for the £i-norm, 
we can take H to be the unit simplex. 



Proposition 1.10. Given a convex compact set H, let £Ih be the 
associated //-norm and Qfj as defined in Lemma [1. 11 Define the mirror 
image of H as the set Mirr(-ff) = {v € M p \ \v\ € H} and denote 
the convex hull of a set S by Conv(S'). Then the unit ball of Qh is 
Conv(Mirr(iJ)). 



Proof. By construction: 

fiff(K) = fijy(l*| 1/2 ) 2 = maxT7 T |K| 

rieH 

= max w T k = max w T k, 

\w\eH ioeConv(Mirr(_ff)) 

since the maximum of a convex function over a convex set is attained 
at its extreme points. But C = Conv(Mirr(//)) is by construction a 
centrally symmetric convex set, which is bounded and closed like H, 
and whose interior contains since H contains at least one point whose 
components are strictly positive. This implies by Theorem 15.2 in [116 
that C is the unit ball of a norm (namely x \— > inf{A G M + | x £ AC}), 
which by duality has to be the unit ball of Cljj- □ 

This proposition combined with the result of Lemma 11.11 therefore 
shows that if Conv(Mirr(i/)) = Conv(Mirr(#')) then H and H' de- 
fine the same norm. 

Several instances of the general variational form we considered in 
this section have appeared in the literature |71} 1110} 1111] . For norms 
that are not subquadratic, it is often the case that their dual norm 
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is itself subquadratic, in which case symmetric variational forms can 
be obtained [2]. Finally, we show in Section [5] that all norms admit 
a quadratic variational form provided the bilinear form considered is 
allowed to be non-diagonal. 

1.5 Multiple Kernel Learning 

A seemingly unrelated problem in machine learning, the problem of 
multiple kernel learning is in fact intimately connected with sparsity- 
inducing norms by duality. It actually corresponds to the most natu- 
ral extension of sparsity to reproducing kernel Hilbert spaces. We will 
show that for a large class of norms and, among them, many sparsity- 
inducing norms, there exists for each of them a corresponding multiple 
kernel learning scheme, and, vice-versa, each multiple kernel learning 
scheme defines a new norm. 

The problem of kernel learning is a priori quite unrelated with par- 
simony. It emerges as a consequence of a convexity property of the 
so-called "kernel trick", which we now describe. Consider a learning 
problem with f(w) = ifj(Xw). As seen before, this corresponds to lin- 
ear predictions of the form Xw = (w 1 JCi)i=i ...,n- Assume that this 
learning problem is this time regularized this time by the square of the 
norm SI (as shown in Section \l.2\ this does not change the regulariza- 
tion properties), so the we have the following optimization problem: 




(1.18) 



As in Eq. (|1.12|) we can introduce the linear constraint 




u = Xw 



(1.19) 



and reformulate the problem as the saddle point problem 



min maxib(u)-\ — £l(w) 2 — \ot T (u — Xw 



(1.20) 



Since the primal problem (j!.19j) is a convex problem with feasible linear 
constraints, it satisfies Slater's qualification conditions and the order 
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of maximization and minimization can be exchanged: 

max min (tb(u) — (w) 2 + a T Xw). (1.21) 

Now, the minimization in u and u> can be performed independently. 
One property of norms is that the Fenchel conjugate of w i— > ^Q(w) 2 
is k 4 ^^*(k) 2 ; this can be easily verified by finding the vector w 
achieving equality in the sequence of inequalities k t w < Q(w) 0*(k) < 
| [il(iu) 2 + ri*(ft) 2 ] . As a consequence, the dual optimization problem 
is 

max -ib*(Xa) - -n*(X T a) 2 . (1.22) 

If is the Euclidean norm (i.e., the ^morm) then the previous problem 
is simply 

G(K) := max -ip*(\a) - -a T Ka with K = XX T . (1.23) 
Focussing on this last few remarks are crucial: 

(1) The dual problem depends on the design X only through the 
kernel matrix K = XX T € R nxn . 

(2) G is a convex function of K (as a maximum of linear func- 
tions). 

(3) The solutions w* and a* to the primal and dual problems 
satisfy w* = X 1 a* = Y17=i a i x i- 

(4) The exact same duality result applies for the generalization 
to w, Xi G "K for "K a Hilbert space. 



The first remark suggests a way to solve learning problems 
that are non-linear in the inputs x^. in particular consider a non- 
linear mapping which maps X{ to a high-dimensional <f)(xi) € 
•K with Oi = R d for d > p or possibly an infinite dimensional 
Hilbert space. Then consider the problem (|1.18|) with now f(w) = 
ip(((w, <^(sBi)))i=i,...,n) i which is typically of the form of an empirical 
risk f(w) = ^ X^ILi i { w i ^{ x i)))- I* becomes high-dimensional to 
solve in the primal, while it is simply solved in the dual by choosing a 
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kernel matrix with entries Kij = ((j)(xi),(f)(xj)), which is advantageous 
as soon as n 2 < d; this is the so-called "kernel trick" (see more details 

in [mjEU). 

In particular if we consider functions h £ IK where "K is a repro- 
ducing kernel Hilbert space (RKHS) with reproducing kernel K then 

M{{h{ Xl )) l=l _ n ) + hh\\l (1.24) 

is solved by solving Eq. (I1.23f) with = K(xi,Xj). When applied to 
the mapping (f> : x \— > K(x, •), the third remark above yields a specific 
version of the representer theorem of Kimmeldorf and Wahba |75|^l 
stating that h*(-) = Y17=i a iK( x i> ')• I n this case, the predictions may 
be written equivalently as h(xi) or (w,(j>(xi)), i = 1, . . . ,n. 

As shown in [79] , the fact that G is a convex function of K suggests 
the possibility of optimizing the objective with respect to the choice of 
the kernel itself by solving a problem of the form minxes G(K) where 
% is a convex set of kernel matrices. 

In particular, given a finite set of kernel functions (-fQ)i<i<p it is 
natural to consider to find the best linear combination of kernels, which 
requires to add a positive definiteness constraint on the kernel, leading 
to a semi-definite program |79j : 

minGGXi^K;) s.t. E?=i»*«i h 0, tr(X^Li m K i) < 1- (1-25) 



Assuming that the kernels have equal trace, the two constraints of the 
previous program are avoided by considering convex combinations of 
kernels, which leads to a quadratically-constrained quadratic program 
(QCQP) 



minGQXi^) s.t. ELi^ = l- (1-26) 



We now present a reformulation of Eq. (|1.26|) using sparsity-inducing 
norms (see [7] [TBI [TTT] for more details). 



12 Note that this provides a proof of the representer theorem for convex losses only and 
that the parameters a arc obtained through a dual maximization problem. 
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1.5.1 Prom £i/^2-Regularization to MKL 

As we presented it above, MKL arises from optimizing the objective 
of a learning problem w.r.t. to a convex combination of kernels, in the 
context of plain £2- or Hilbert norm regularization, which seems a pri- 
ori unrelated to sparsity. We will show in this section that, in fact, the 
primal problem corresponding exactly to MKL (i.e., Eq. 11.260 is an 
l\ /^-regularized problem (with the norm defined in Eq. (I1.2D ). 

in the sense that its dual is the MKL problem for the set of kernels 
associated with each of the groups of variables. The proof to estab- 
lish the relation between the two relies on the variational formulation 
presented in Section 11.4.21 

We indeed have, assuming that S is a partition of {1, . . . ,p}, with 
|S| = m, and A m denoting the simplex in W™, 



mm i>(Xw) + ±(Y, geS \\w 9 h) 



2 



mm 



min *P(E g eS vl^Xg^g) + £ ^ ||»» 



,»7SA m 1 t - 



2 



i/>(Xw) + hwg s.t. X = [r,]( 2 X gi ,. . .^X g 



mm 

?7£A m 2 



A 



= min G(£ eS T) g K g ), 

~ 1/2 

where the third line results from the change of variable w g r\ g = w g , 
and the last step from the definition of G in Eq. f|1.23|) . 

Note that ^i-regularization corresponds to the special case where 
groups are singletons and where HQ = rank-one kernel matrix. 

In other words, MKL with rank-one kernel matrices (i.e., feature spaces 
of dimension one) is equivalent to ^i-regularization (and thus simpler 
algorithms can be brought to bear in this situation). 

We have shown that learning convex combinations of kernels 
through Eq. (I1.26P turns out to be equivalent to an ^i/^-norm penal- 
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ized problems. In other words, learning a linear combination J^^ =1 TJiKi 
of kernel matrices, subject to r] belonging to the simplex A m is equiva- 
lent to penalizing the empirical risk with an ^i-norm applied to norms 
of predictors ||iw ff ||2. This link between the ^i-norm and the simplex 
may be extended to other norms, among others to the subquadratic 
norms introduced in Section 11.4.21 



1.5.2 Structured Multiple Kernel Learning 



In the relation established between ^/^-regularization and MKL in 
the previous section, the vector of weights r] for the different kernels 
corresponded with the vector of optimal variational parameters defining 
the norm. A natural way to extend MKL is, instead of considering a 
convex combination of kernels, to consider a linear combination of the 
same kernels, but with positive weights satisfying a different set of 
constraints than the simplex constraints. Given the relation between 
kernel weights and the variational form of a norm, we will be able to 
show that, for norms that have a variational form as in Lemma ll.81 we 
can generalize the correspondence between the ^i/^2-norm and MKL 
to a correspondence between other structured norms and structured 
MKL schemes. 

Using the same line of proof as in the previous section, and given 
an H-noim. (or equivalently a subquadratic norm) Qh as defined in 
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Definition 11.21 we have: 

(1.27) 



This results shows that the regularization with the norm Qh in the 
primal is equivalent to a multiple kernel learning formulation in which 
the kernel weights are constrained to belong to the convex set H, which 
defines £Ih variationally. Note that we have assumed that H C R2., 
so that formulations such as (ll.25p . where positive semidefiniteness of 
Y^h=i Vi^i nas to be added as a constraint, are not included. 

Given the relationship of MKL to the problem of learning a func- 
tion in a reproducing kernel Hilbert space, the previous result suggests 
a natural extension of structured sparsity to the RKHS settings. In- 
deed let, h = (hi, . . . , h p ) £ S := Mi x . . . x Jf p , where "Ki are RKHSs. 
It is easy to verify that A : h i-> fiff ((||/ii||:Ki> • • • > ll^pl|jf p )) is a con- 
vex function, using the variational formulation of and since it is 
also non-negative definite and homogeneous, it is a nor 

nQ 

Moreover, 

the learning problem obtained by summing the predictions from the 
different RKHSs, i.e., 

A 2 
minip((hi(xi) + . . . + h p (x i ))i= 1) ... jn ) + -tt H ((\\hi\\x 1 , . . . , H^plkp)) 

(1.28) 



As we show in Section 13.41 it actually sufficient to assume that Q is monotonic for A to 
be a norm. 



min 






= min 


i=i n 




= min 




p 

i=l 


= min 


tp(Xw) + — \\w 2 S.t. 


X = [ V 


= min max 


-</>*(Aa)- 


Vi K i) a 


= min 


GiEtxViKi). 
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is equivalent, by the above derivation, to the MKL problem 
mm^H G(^2 P i=l r]iKi) with [Ki]jj> = Ki(xj,Xj>) for K{ the repro- 
ducing kernel of JCj. See Section [3.41 for more details. 

This means that, for most of the structured sparsity-inducing norms 
that we have considered in Section Tl. 3\ we may replace individual vari- 
ables by whole Hilbert spaces. For example, tree-structured sparsity 
(and its extension to directed acyclic graphs) was explored in [9] where 
each node of the graph was a RKHS, with an application to non-linear 
variable selection. 



2 



Generic Methods 



The problem defined in Eq. (|l.ip is convex, as soon as both the loss / 
and the regularizer f2 are convex functions. In this section, we consider 
optimization strategies which are essentially blind to problem struc- 
ture. The first of these techniques is subgradient descent (see, e.g., 
|18j). which is widely applicable, has low running time complexity per 
iterations, but has a slow convergence rate. As opposed to proximal 
methods presented in Section 13.14 it does not use problem structure. 
At the other end of the spectrum, the second strategy is to consider 
reformulations such as linear programs (LP), quadratic programs (QP) 
or more generally, second-order cone programming (SOCP) or semidef- 
inite programming (SDP) problems (see, e.g., [25J). The latter strategy 
is usually only possible with the square loss and makes use of general- 
purpose optimization toolboxes. Moreover, these toolboxes are only 
adapted to small-scale problems and usually lead to solution with very 
high precision (low duality gap), while simpler iterative methods can 
be applied to large-scale problems but only leads to solution with low 
precision, which is sufficient in most applications to machine learn- 
ing (see [22] for a detailed discussion). 
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Subgradient descent. For all convex unconstrained problems, sub- 
gradient descent can be used as soon as one subgradient can be com- 
puted efficiently. In our setting, this is possible when a subgradient of 
the loss /, and a subgradient of the regularizer Q can be computed. This 
is true for all the norms that we have considered. The corresponding 
algorithm consists of the following iterations: 

(X 

Wf+i =w t - jp(s + As ), where s G df(w t ), s G d£l(w t ), 

with a a well-chosen positive parameter and j3 typically 1 or 1/2. Un- 
der certain conditions, these updates are globally convergent. More 
precisely, we have, from [101 ], F(w t ) — mm w£ ^ P F(w) = O(^r) for 
Lipschitz-continuous function and /3 = 1/2. However, the convergence 
is in practice slow (i.e., many iterations are needed), and the solutions 
obtained are usually not sparse. This is to be contrasted with the prox- 
imal methods presented in the next section which are less generic but 
more adapted to sparse problems, with in particular convergence rates 
in 0(l/t) and 0(l/t 2 ). 

Reformulation as LP, QP, SOCP, SDP. For all the sparsity- 
inducing norms we consider in this paper the corresponding regularized 
least-square problem can be represented by standard mathematical pro- 
gramming problems, all of them being SDPs, and often simpler (e.g., 
QP). For example, for the ^i-norm regularized least-square regression, 
we can reformulate min.„,eRp ^\\y — -Xm>||! + AO(iw) as 

min — \\y — Xw + + Xw_\\2 + X(1 T w + + l T w_), 
w + ,w~m p + 2n 

which is a quadratic program. Grouped norms with combinations of 
norms leads to an SOCP, i.e., min wg R P ^\\y — Xw\\\ + d g \\w g \\2 
may be formulated as 

min —\\y - Xw\\\ + \^^d g t g s.t \/g G 9, \\w g \\ 2 <t g . 

™eRp, {t g ) g& smf ln ge5 

Other problems can be similarly cast (for the trace norm, see [8j [49]). 
General-purpose toolboxes can then be used, to get solutions with high 
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precision (low duality gap). However, in the context of machine learn- 
ing, this is inefficient for two reasons: (1) these toolboxes are generic 
and blind to problem structure and tend to be too slow, or cannot even 
run because of memory problems, (2) as outlined in [22], high precision 
is not necessary for machine learning problems, and a duality gap of 
the order of machine precision (which would be a typical result from 
such toolboxes) is not necessary. 

We present in the following sections methods that are adapted to 
problems regularized by sparsity- inducing norms. 



3 



Proximal Methods 



This chapter reviews a class of techniques referred to as proximal meth- 
ods, where, the non-smooth component of the objective (jl.ip will only 
be involved in the computations through an associated proximal oper- 
ator, which we formally define subsequently. 

The presentation that we make of proximal methods in this chapter 
is deliberately simplified, and to be rigorous the methods that we will 
refer to as proximal methods in this section are known as forward- 
backward splitting methods. We refer the interested reader to Section [9] 
for a broader view and references. 

3.1 Principle of Proximal Methods 

Proximal methods (i.e., forward-backward splitting methods) are 
specifically tailored to optimize an objective of the form i.e., 
which can be written as the sum of a generic smooth differentiable 
function / with Lipschitz-continuous gradient, and a non-differentiable 
function \Q. 

They have drawn increasing attention in the machine learning com- 
munity, especially because of their convergence rates and their ability to 
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deal with large nonsmooth convex problems (e.g., [T7 1 l38l [1031 1151] ). 
Proximal methods can be described as follows: at each iteration the 
function / is linearized around the current point and a problem of the 
form 

mm /(tu*)+V/(u>*) T (u> - w l ) + AO(xo) + -\\w - (3.1) 

is solved. The quadratic term, called proximal term, keeps the update 
in a neighborhood of the current iterate w l where / is close to its 
linear approximation; L>0 is a parameter, which should essentially be 
an upper bound on the Lipschitz constant of V/ and is typically set 
with a line-search. This problem can be rewritten as 

min h\w- (tu*--^V/(ti;*))||J + -n(w). (3.2) 

It should be noted that when the nonsmooth term f2 is not present, 
the solution of the previous proximal problem, also known as the back- 
ward or implicit step, just yields the standard gradient update rule 
w t+1 4— w t — j^Vf(w t ). Furthermore, if f2 is the indicator function of a 
set lc, i.e., defined by lc{x) = for x E C and ic{x) = +oo otherwise, 
then solving (|3.2H yields the projected gradient update with projection 
on the set C. This suggests that the solution of the proximal problem 
provides an interesting generalization of gradient updates, and moti- 
vates the introduction of the notion of a proximal operator associated 
with the regularization term AfL 

The proximal operator, which we will denote Prox^^, was defined in 
[95] as the function that maps a vector u G W to the uniqu^] solution 
of 

min —\\u — w\\o + U^i(w). (3.3) 

This operator is clearly central to proximal methods since their main 
step consists in computing ProxA n (tu* — -^V/(to*)). 

In Section 13. 3( we present analytical forms of proximal operators 
associated with simple norms and algorithms to compute them in some 
more elaborate cases. Note that the proximal term in Eq. (|3,ip could 



1 Since the objective is strongly convex. 
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be replaced by any Bregman divergences (see, e.g., |140j ). which may 
be useful in settings where extra constraints (such as non-negativity) 
are added to the problem. 

3.2 Algorithms 

The basic proximal algorithm uses the solution of problem (|3.2|) as 
the next update w t+1 ; however fast variants such as the accelerated 
algorithm presented in |103] or FISTA [T7] maintain two variables and 
use them to combine at marginal extra computational cost the solution 
of (13. 2p with information about previous steps. Often, an upper bound 
on the Lipschitz constant of V/ is not known, and even if it i 

& it 

is often better to obtain a local estimate. A suitable value for L can 
be obtained by iteratively increasing L by a constant factor until the 
condition 

f(wt) < /(«,*) + V/(^) T « - «,*) + " w'Wl (3.4) 

is met, where w* L denotes the solution of ()3.3|) . 

For functions / whose gradients are Lipschitz-continuous, the basic 
proximal algorithm has a global convergence rate in O(j) where t is 
the number of iterations of the algorithm. Accelerated algorithms like 
FISTA can be shown to have global convergence rate — on the objective 
function — in O(-p), which has been proved to be optimal for the class 
of first-order techniques [101 j . 

Note that, unlike for the simple proximal scheme, we cannot guar- 
antee that the sequence of iterates generated by the accelerated version 
is itself convergent |38j . 

Perhaps more importantly, both basic (ISTA) and accelerated |103| 
proximal methods are adaptive in the sense that if / is strongly 
convex — and the problem is therefore better conditioned — the conver- 
gence is actually linear (i.e., with rates in 0(C ) for some constant 
C < 1; see |103| ). Finally, it should be noted that accelerated schemes 
are not necessarily descent algorithms, in the sense that the objective 

2 For problems common in machine learning where f(w) = ip(Xw) and ip is twice differen- 
tiable, then L may be chosen to be the largest eigenvalue of — X T X times the supremum 
over u £ R™ of the largest eigenvalue of the Hessian of ip at u. 
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does not necessarily decrease at each iteration in spite of the global 
convergence properties. 

3.3 Computing the Proximal Operator 

Computing the proximal operator efficiently and exactly allows to at- 
tain the fast convergence rates of proximal methodso We therefore 
focus here on properties of this operator and on its computation for 
several sparsity-inducing norms. For a complete study of the proper- 
ties of the proximal operator, we refer the interested reader to [39] . 

Dual proximal operator. In the case where O is a norm, by Fenchel 
duality, the following problem is dual (see Proposition 1 1 ,6|) to problem 

max — - [||t> — will — ||it|| 2 ] such that Q*(v) < ri. (3.5) 



Lemma 3.1 (Relation to Dual Proximal Operator). Let 

Prox^Q be the proximal operator associated with the regulariza- 
tion fj,Q, where O is a norm, and Proj j^*(.) <At } be the projector 
on the ball of radius ri of the dual norm Q*. Then Proj j^*(.) <At } is 
the proximal operator for the dual problem (|3.5|) and, denoting the 
identity 1^, these two operators satisfy the relation 

Prox^ = Id- Pr °j{n* (•)</*} • ( 3 - 6 ) 

Proof. By Proposition ll.6[ if w* is optimal for (|3.3p and v* is optimal 
for (I3.5p . we haveEl — v* = Vf(w*) = w* — u. Since v* is the projection 
of u on the ball of radius fj, of the norm Q*, the result follows. 



This lemma shows that the proximal operator can always be computed 
as the residual of a Euclidean projection onto a convex set. More general 
results appear in 

3 Note, however, that fast convergence rates can also be achieved while solving approxi- 
mately the proximal problem, as long as the precision of the approximation iteratively 
increases with an appropriate rate (see 11 221 for more details). 

4 The dual variable from Fenchel duality is — v in this case. 
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4-norm regularization. Using optimality conditions for (13. 5h and 
then (13. 6|) or subgradient condition (|1.7p applied to (|3.3I) . it is easy to 
check that Projni.ii <fJ \ and Prox^y.^ respectively satisfy: 

[ Pr °j{||-lloo</x}(«)]j = min (!. u i> 

and 

[ PrQX HI-lli( u )]i = I 1 ~ T7~r) u i = s S n ( u i)(l u il ~ ^)+> 

J \ | XL j | / + 

for j € {l,...,p}, with (x)+ := max(x,0). Note that Prox^j.^ is 
componentwise the soft-thresholding operator of [43J presented in Sec- 
tion CCD 

li-norm constraint. Sometimes, the £i-norm is used as a hard con- 
straint and, in that case, the optimization problem is 

min f(w) such that ||tw||i < C. 

w 

This problem can still be viewed as an instance of (jl.lh . with Q defined 
by Q(u) = if Hull i < C and Vt(u) = +oo otherwise. Proximal methods 
thus apply and the corresponding proximal operator is the projection 
on the ^i-ball, itself an instance of a quadratic continuous knapsack 
problem for which efficient pivot algorithms with linear complexity have 
been proposed [271 185| . Note that when penalizing by the dual norm of 
the ^i-norm, i.e., the ^oo-norm, the proximal operator is also equivalent 
to the projection onto an ^i-ball. 

^-regularization (ridge regression). This regularization function 
does not induce sparsity and is therefore slightly off topic here. It is 
nonetheless widely used and it is worth mentioning its proximal oper- 
ator, which is a scaling operator: 

ProxMii ii2 \u] = u. 

2 ,M| 2 L J 1 + Ll 



h + ^2" re § u l ar i za ti° n (Elastic- net |159|). This regularization 
function combines the ^i-norm and the classical squared ^-penalty. For 
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a vector w in MP, it can be written Cl(w) = ||tw||i + % 1 1 -to 1 1 2 , where 7 > 
is an additional parameter. It is not a norm, but the proximal operator 
can be obtained in closed form: 

ProX M(l|.||i+il|.||^) =Prox ^||.||| oProx HI-lla = ^i Prox HI-llx- 

Similarly to the £i-norm, when f2 is used as a constraint instead of a 
penalty, the proximal operator can be obtained in linear time using 
pivot algorithms (see [57], Appendix B.2). 



ID-total variation. Originally introduced in the image process- 
ing community [118j . the total- variation penalty encourages piece- 
wise constant signals. It can be found in the statistics literature un- 
der the name of "fused lasso" |135j . For one-dimensional signals, 
it can be seen as the £i-norm of finite differences for a vector w 
in W: Q,Ty_ir)(w) := X)f=i \ w i+l ~ w i\- Even though no closed form 
is available for Prox^f2 TV 1D , it can be easily obtained using a mod- 
ification of the homotopy algorithm presented later in this paper in 
Section 16.21 (see [551 ES])- Similarly, it is possible to combine this 
penalty with the l\- and squared ^2-penalties and efficiently com- 
pute Prox H H ,72.11 112 or use such a regularization function in 
a constrained formulation (see |87j . Appendix B.2). 



Anisotropic 2D-total variation. The regularization function 
above can be extended to more than one dimension. For a two 
dimensional-signal W in W° xl this penalty is defined as ^tv-2d(W0 := 
Ya=i Zj=i \ w i+i,j ~ w i,j\ ~ w i,j\ - Interestingly, it has been 

shown in [33] that the corresponding proximal operator can be obtained 
by solving a parametric maximum flow problem. 



fi/^-norm ("group Lasso"). If S is a partition of {1, ... ,p}, the 
dual norm of the t\jt q norm is the /£ q > norm, with - + ^ = 1. It 
is easy to show that the orthogonal projection on a unit loo/£q' ball is 
obtained by projecting separately each subvector u g on a unit iy-ball 
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in W 9 K For the ^1/^2-norm ft ■ w 1— > Ylges ll^slb we have 

[Prox^(u)]p = (l - F A r ) up, 5 G S. (3.7) 

v ||Mp||2' + 

This is shown easily by considering that the subgradient of the -^-norm 
is 9||iw||2 = { | | ^j | 2 } if w 7^ or 9||u;||2 = {z \ \\z\\2 < 1} if w = and 
by applying the result of Eq. (II. 7p . 

For the ^i/^oo-norm, whose dual norm is the ^oo/^i-norm, an effi- 
cient algorithm to compute the proximal operator is based on Eq. (|3.6|) , 
Indeed this equation indicates that the proximal operator can be com- 
puted on each group g as the residual of a projection on an ^i-norm 
ball in M' 5 ' ; the latter is done efficiently with the previously mentioned 
linear-time algorithms. 



£i/^2-norm constraint. When the £i/^2-norm is used as a constraint 
of the form fi(iw) < C, computing the proximal operator amounts to 
perform an orthogonal projection onto the ^1/^2-ball of radius C. It 
is easy to show that such a problem can be recast as an orthogonal 
projection onto the simplex |42j . We know for instance that there ex- 
ists a parameter \i > such that the solution w* of the projection 
is Prox^n[it] whose form is given in Equation (I3.7p . As a consequence, 
there exists scalars z 9 > such that w* = [uf-jn u g (where to simplify 
but without loss of generality we assume all the u g to be non-zero). 
By optimizing over the scalars z 9 , one can equivalently rewrite the 
projection as 

min ^DKIk-* 9 ) 2 B -*- 2>^ C < 

which is a Euclidean projection of the vector [||itp||2]geS m ^ onto a 
simplexll. The optimization problem above is then solved in linear time 
using the previously mentioned pivot algorithms [27, 85]. 

We have shown how to compute the proximal operator of group- 
norms when the groups form a partition. In general, the case where 



5 This result also follows from Lemma l3. 31 applied to the compute the proximal operator of 
the I'oo/fe-norm which is dually related to the projection on the <?i/£2-norm 
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groups overlap is more complicated because the regularization is no 
longer separable. Nonetheless, in some cases it is still possible to com- 
pute efficiently the proximal operator. 

Hierarchical fi/^-norms. Hierarchical norms were proposed in 
[157 1 . Following [69 1, we focus on the case of a norm Q : w t— >■ 
S ff eS II^sIIq' w hh q £ {2,oo}, where the set of groups S is tree- 
structured, meaning that two groups are either disjoint or one is in- 
cluded in the other. Let ■< be a total order such that g\ < g2 if and only 
if either g\ C <?2 or g\ n <?2 = 0@ Then, if g\ X . . . X g m with m = |S|, 
and if we define U g as (a) the proximal operator w g 1— > Piox^.^^Wg) 
on the subspace corresponding to group g and (b) the identity on the 
orthogonal, it can be shown [69J that: 



In other words, the proximal operator associated with the norm can 
be obtained as the composition of the proximal operators associated 
with individual groups provided that the ordering of the groups is well 
chosen. Note that this result does not hold for q ^ {1,2, 00} (see [f)U] 
for more details). In terms of complexity, Prox^Q can be computed in 
0(p) operations when q = 2 and 0(pd) when q = 00, where d is the 
depth of the tree. 

Combined ("sparse group Lasso"), with q € 

{2, 00}. The possibility of combining an £i/£ q -noTm that takes advan- 
tage of sparsity at the group level with an £i-norm that induces sparsity 
within the groups is quite natural [501 H30] . Such regularizations are 
in fact a special case of the hierarchical £i/£ ? -norms presented above 
and the corresponding proximal operator is therefore readily computed 
by applying first the proximal operator associated with the £i-norm 
(soft-thresholding) and the one associated with the ^i/^-norm (group 
soft-thresholding) . 



6 For a tree-structured S such an order exists. 



Prox^n = lJ gm o ... 



on, 
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Overlapping 4/foo-norms. When the groups overlap but do not 
have a tree structure, computing the proximal operator has proven to 
be more difficult, but it can still be done efficiently when q = oo. Indeed, 
as shown in [88], there exists a dual relation between such an operator 
and a quadratic min-cost flow problem on a particular graph, which 
can be tackled using network flow optimization techniques. Moreover, 
it may be extended to more general situations where structured sparsity 
is expressed through submodular functions |10j . 

Trace norm and spectral functions. The proximal operator for 
the trace norm, i.e., the unique minimizer of ^||A<f — iV|||i + /i||JVf||* 
for a fixed matrix M, may be obtained by computing a singular value 
decomposition of N and then replacing the singular values by their 
soft-thresholded versions [29]. This result can be easily extended to 
the case of spectral functions. Assume that the penalty $7 is of the 
form Q(M) = ip(s) where s is a vector carrying the singular val- 
ues of M and ip a convex function which is invariant by permu- 
tation of the variables (see, e.g., [20] ). Then, it can be shown that 
Prox^nfiV] = U Diag(Prox At ^,[s])V T , where U Diag(s)V T is a singu- 
lar value decomposition of N. 

3.4 Proximal Methods for Structured MKL 

In this section we show how proximal methods can be applied to solve 
multiple kernel learning problems. More precisely, we follow [96J who 
showed, in the context of plain MKL that proximal algorithms are 
applicable in a RKHS. We extend and present here this idea to the 
general case of structured MKL, showing that the proximal operator 
for the structured RKHS norm may be obtained from the proximal 
operator of the corresponding subquadratic norms. 

Given a collection of reproducing kernel Hilbert spaces 
we consider the Cartesian product 23 := "K\ x . . . x !K p , equipped with 
the norm ||/i||cb := (||^i + • • • + II^pIIik J) ) 1//2 > where h = (hi, . . . , h p ) 
with hi € "Hi. 

The set 23 is a Hilbert space, in which gradients and subgradients 
are well defined and in which we can extend some algorithms that we 
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considered in the Euclidean case easily. 

In the following, we will consider a monotonic norm as defined in 
Definition II. 11 This is motivated by the fact that a monotonic norm 
may be composed with norms of elements of Hilbert spaces to defines 
a norm on 23: 



Lemma 3.2. Let f2 be a monotonic norm on W with dual norm Q* , 
then A : ft H ^((l|^l[|j£i? • • • > [|^pl|jf p )) is a norm on 23 whose dual 
norm is A* : g i-4 O* ((||<7i H^, . . . , ||5p||jc p ))- Moreover the subgradient 
of A is dA(h) = A(h) with A(h) := {(u\Si, . . . , u p s p ) \ u £ B(h), S{ G 
d\\ • || % (fti)} with B(h) : = SS2((||fti|| Ml , . . . , WhpWxj). 



Proof. It is clear that A is symmetric, nonnegative definite and homo- 
geneous. The triangle inequality results from the fact that Q is mono- 
tonic. Indeed the latter property implies that A(ft + g) = fM(||ftj + 
gi\\Hi)i<i<p) < ^((HftilljCj + ||5ilUi)i<i<p) and the result follows from 
applying the triangle inequality for Q. 

Moreover, we have the generalized Cauchy-Schwarz inequality: 

{h,9h = ^{hi,gi)-K t < ll^ilUJbilki < H h ) A *(f), 

i i 

and it is easy to check that equality is attained if and only if g S A(h). 
This simultaneously shows that A(ft) = max fle s, A*(g)<l(^> (as a 
consequence of Proposition II. 4|) and that dA(h) = A{h) (by Proposi- 
tion H2D. □ 

We consider now a learning problem of the form: 

min f(h 1 ,...,h p ) + XA(h), (3.9) 

with, typically, following Section[L5l /(ft) = £ £"=i £(y^, h(x t )). The 
structured MKL case corresponds more specifically to the case where 
f(h) = i^ =1 -£(yW,fti(xi) + ... + ftp(xi)). Note that the problem 
we consider here is regularized with A and not A 2 as opposed to the 
formulations ()1.24|) and (|1.28p considered in Section 11.51 

To apply the proximal methods introduced in this chapter using 
|| • ||s as the proximal term requires one to be able to solve the proximal 
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problem: 

^hh-g\\l+f^A(h). (3.10) 

The following lemma shows that if we know how to compute the 
proximal operator of 0, for an I2 proximity term in R p , we can readily 
compute the proximal operator of A for the proximity defined by the 
Hilbert norm on S. Indeed, to obtain the image of h by the proximal 
operator associated with A, it suffices to apply the proximal operator 
of fi to the vector of norms (||fei||jCij ■ ■ ■ > H^plIlK „) which yields a vector 
(yi, . . . , y p ), and to scale in each space JCj, the function hi by /it Im- 
precisely: 

Lemma 3.3. Prox^sO = (visi, ■ ■ ■ , y P s p ) where = if gi = 0, 

Si = irni — if 5i/o and y = Prox M Q((||5 i ||j {i )i< i <p). 

\\9i\Wi 



Proof. To lighten the notations, we write \\hi\\ for ||/ii||j{j if hi G "Ki. 
The optimality condition for problem (I3,10p is h — g £ —fidA so that 
we have hi = gi — fiSiiii, with u £ B(h), Sj G <9|| • ||^(/ij). The last 
equation implies that hi,g{ and Si are colinear. If gi = then the fact 
that f2 is monotonic implies that hi = Si = 0. If on the other hand, 
gi 7^ we have hi = gi (1 — fmj)+ an d thus \\hi\\ = (\\gi\\ — fiUi) + and 
hi = Si \\hi\\, but by definition of yi we have yi = (\\gi\\ — iiUi) + , which 
shows the result. □ 

This results shows how to compute the proximal operator at an 
abstract level. For the algorithm to be practical, we need to show that 
the corresponding computation can be performed by manipulating a 
finite number of parameters. 

Fortunately, we can appeal to a representer theorem to that end, 
which leads to the following lemma: 



Lemma 3.4. Assume that for all i, gi = X]j=i a ijKi( x ji ")■ Then the 
solution of problem (|3.1U|) is of the form hi = Y^ = iflijKi(xj,-). Let 



52 Proximal Methods 



y = Ptox^q(( J acj K k ct k )i< k < p ) . Then if a; ^ 0, A = 7 fe Q » 
and otherwise /3j = 0. 



Proof. We first show that a representer theorem holds. For each i let M 



be the component of hi in the span of Ki(xj, -)i<j<n and hf = hi — hf 
We can rewrite the objective of problem (|3.1U|) aaij 



^E[ii^ii 2 +ii^ii 2 -2</*f,^+^^ 

i=l 

from which, given that f2 is assumed monotonic, it is clear that set- 
ting h± = for all i can only decrease the objective. To conclude, 
the form of the solution in j3 results from the fact that HgiH^. = 
T,i<j,j'<n a ij a ij'( K i( x jr),K i (x j/ ,-)}^ and (K^Xj,-),^^,-))^ = 
Ki(xj,Xj>) by the reproducing property, and by identification (note 
that if the kernel matrix Ki is not invertible the solution might not be 
unique in A). □ 

Finally, in the last lemma we assumed that the function in the 
proximal problem could be represented as a linear combination of the 
Ki(xj, •). Since gi is typically of the form h\ — xw^ffoi, • • ■ >^p)> then 
the result follows by linearity if the gradient is in the span of the 
K{(xj, •). The following lemma shows that this is indeed the case: 

Lemma 3.5. For f(h) = \ ££ =1 e(y^),h 1 (x j ), . . . , h p ( Xj )) then 
d n 1 

Qh.f{ti) = 1 ^2a ij K i (xj,-) for a y = -di£(y^>\ hi(xj), . . . , h p (xj)), 
1 i=i 

where dil denote the partial derivative of I w.r.t. to its i + 1-th scalar 
component. 

Proof. This result follows from the rules of composition of differentia- 
tion applied to the functions 

(hi,..., hp) h- £(y { °\ (h^K^Xj,-))^, (h p ,K p (xj,-)) Kp ), 



7 We denote again \\hi\\ for \\hi\\ ^. , when the RHKS norm used is implied by the argument. 
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and the fact that, since hi i-> (hi,Ki(xj, ■))'k 1 is linear, its gradient in 
the RKHS JC* is just K^Xj, ■). 

□ 



4 



(Block) Coordinate Descent Algorithms 



Coordinate descent algorithms solving ^-regularized learning problems 
go back to [52]. They optimize (exactly or approximately) the objective 
function with respect to one variable at a time while all others are kept 
fixed. Note that, in general, coordinate descent algorithms are not nec- 
essarily convergent for non-smooth optimization (cf. [18J, p. 636); they 
are however applicable in this setting because of a separability property 
of the nonsmooth regularizer we consider (see end of Section 14. ip . 

4.1 Coordinate Descent for li-Regularization 

We consider first the following special case of the one-dimensional i\- 
regularized problem: 



As shown in (|l,5p . the minimizer w* can be obtained by soft- 
thresholding: 






(4.2) 
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Lasso case. In the case of the square loss, the minimization with 
respect to a single coordinate can be written as 

mm Vj/(ii>*) {Wj - «?*•) + -V^- /(w^Wj - w)) 2 + X\ Wj \, 

with Vjf(w) = \X]{Xw - y) and V] j /(w) = ^XjXj independent 
of w. Since the above equation is of the form (|4.ip . it can be solved in 
closed form: 

«J = Prox, H ( W j - V,/^:)^./). (4.3) 

In other words, Wj is obtained by solving the unregularized problem 
with respect to coordinate j and soft-thresholding the solution. 

This is the update proposed in the shooting algorithm of Fu [52], 
which cycles through all variables in a fixed order Q Other cycling 
schemes are possible (see, e.g., |104| ). 

An efficient implementation is obtained if the quantity Xw 1 — y or 
even better V f{w t ) = ^X T Xw l — ^X T y is kept updated^ 

Smooth loss. For more general smooth losses, like the logistic loss, 
the optimization with respect to a single variable cannot be solved in 
closed form. It is possible to solve it numerically using a sequence of 
modified Newton steps as proposed in [128J. We present here a fast 
algorithm of Tseng and Yun |141| based on solving just a quadratic 
approximation of / with an inexact line search at each iteration. 
Let L* > be a parameter and let Wj be the solution of 

min Vjf(w t ) (wj - wf) + \-L l (wj - w*) 2 + X\wj\, 

Given d = Wj — w l - where w*- is the solution of (|4.3|) . the algorithm of 
Tseng and Yun performs a line search to choose the largest step of the 

1 Coordinate descent with a cyclic order is sometimes called a Gauss-Seidel procedure. 

2 In the former case, at each iteration, Xw — y can be updated in 0(n) operations if wj 
changes and V^t+i f(w) can always be updated in ©(n) operations. The complexity of 
one cycle is therefore 0(pn). However a better complexity is obtained in the latter case, 
provided the matrix X T X is precomputed (with complexity 0(p 2 n)). Indeed V/(io 4 ) is 
updated in ©(p) iterations only if Wj does not stay at 0. Otherwise, if wj stays at the 
step costs O(l); the complexity of one cycle is therefore Q(ps) where s is the number of 
non-zero variables at the end of the cycle. 
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form ad with a = a^fi k and ao > 0,/3 G (0, 1), A; G N, such that the 
following modified Armijo condition is satisfied: 

F(w* + adej) - Fiw 1 ) < aa(V j f(w)d + jL t d 2 + |to*- + d| - |io*-|), 

where F(w) := /(tw)+A Sl(xy), and < 7 < 1 and cr < 1 are parameters 
of the algorithm. 

Tseng and Yun |141j show that under mild conditions on / the 
algorithm is convergent and, under further assumptions, asymptoti- 
cally linear. In particular, if / is of the form ^ Ya=i ^-{v 1 wT x^) 
with i(y^',-) a twice continuously differentiable convex function with 
strictly positive curvature, the algorithm is asymptotically linear for 
L* = V|/(ioj). We refer the reader to Section 031 and to [14111150] for 
results under much milder conditions. It should be noted that the algo- 
rithm generalizes to other separable regularizations than the ^i-norm. 

Variants of coordinate descent algorithms have also been considered 
in \54\ [771 1152] . Generalizations based on the Gauss-Southwell rule have 
been considered in [141] . 

Convergence of coordinate descent algorithms. In general, co- 
ordinate descent algorithms are not convergent for non-smooth ob- 
jectives. Therefore, using such schemes always requires a convergence 
analysis. In the context of the £i-norm regularized smooth objective, 
the non-differentiability is separable (i.e., is a sum of non-differentiable 
terms that depend on single variables), and this is sufficient for con- 
vergence [181 1141] . In terms of convergence rates, coordinate descent 
behaves in a similar way to first-order methods such as proximal meth- 
ods, i.e., if the objective function is strongly convex [10411141] . then the 
convergence is linear, while it is slower if the problem is not strongly 
convex, e.g., in the learning context, if there are strong correlations 
between input variables [125] , 

4.2 Block-Coordinate Descent for ^/^-Regularization 

When O(to) is the £i/£ g -norm with groups g G 9 forming a partition 
of {1, . . . , p}, the previous methods are generalized by block-coordinate 
descent (BCD) algorithms, which have been the focus of recent work by 
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Tseng and Yun |141j and Wright [150J. These algorithms do not attempt 
to solve exactly a reduced problem on a block of coordinates but rather 
optimize a surrogate of F in which the function / is substituted by a 
quadratic approximation. 

Specifically, the BCD scheme of [141] solves at each iteration a prob- 
lem of the form: 

min V gfiw 1 ) 1 {wg-w^ + hwg-w^ 1 'ff*(uj 5 -u>*)+A||u\,|| 9 , (4.4) 

where the positive semi-definite matrix H l G fljlslxlsl is a parameter. 
Note that this may correspond to a richer quadratic approximation 
around w g than the proximal terms used in Section [3J However, in 
practice, the above problem is solved in closed form if H l = LrL g \ for 
some scalar L* and q E {2, oo]H. In particular for q = 2, the solution 
w* is obtained by group-soft-thresholding: 

w* = Proxj, |H|a {w g - —VgffaD) , 

with 

Prox HH | 2 (H=(l-|^) + ™. 

In the case of general smooth losses, the descent direction is given 
by d = w* — w g with w* as above. The next point is of the form 
w t + ad, where a is a stepsize of the form a = «o P k , with «o > 0, 
0</3<l,A;GN. The parameter k is chosen large enough to satisfy 
the following modified Armijo condition 

F(w t +ad)-F(w t ) < aa(Vgf(w) T d+-fd T H t d+\\w t g + d\\ q -\\w t g \\ q ), 

for parameters < 7 < 1 and a < 1. 

If / is convex continuously differentiable, lower bounded on MP and 
F has a unique minimizer, provided that there exists r, f fixed constants 
such that for all t, r X H l H f , the results of Tseng and Yun show 
that the algorithm converges (see Theorem 4.1 in [141] for broader 
conditions). Wright [150] proposes a variant of the algorithm, in which 
the line-search on a is replaced by a line search on the parameter L*, 
similar to the line-search strategies used in proximal methods. 

3 More generally for q > 1 and H l = Ltl\ g \ , it can be solved efficiently coordinate- wise 
using bisection algorithms. 
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4.3 Block-coordinate descent for MKL 

Finally, block-coordinate descent algorithms are also applicable to clas- 
sical multiple kernel learning . We consider the same setting and no- 
tation as in Section 13.41 and we consider specifically the optimization 
problem: 



A block-coordinate algorithm can be applied by considering each RKHS 
'Ki as one "block"; this type of algorithm was considered in |114j . Ap- 
plying the lemmas 13,41 and 13.51 of Section 13. 4} we know that hi can be 
represented as hi = X^j=i a ijKi( x j> ')■ 

The algorithm then consists in performing successively group soft- 
thresholding in each RKHS "Hi. This can be done by working directly 
with the dual parameters ctj, with a corresponding proximal operator 
in the dual simply formulated as: 



with ||o:||/t = a T Kot. The precise equations would be obtained by 
kernelizing Eq. 14.41 (which requires kernelizing the computation of the 
gradient and the Hessian as in Lemma [33]). We leave the details to the 
reader. 



p 




8=1 
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Reweighted-^2 Algorithms 



Approximating a nonsmooth or constrained optimization problem by 
a series of smooth unconstrained problems is common in optimiza- 
tion (see, e.g., [251 11021 1105] ). In the context of objective functions 
regularized by sparsity- inducing norms, it is natural to consider varia- 
tional formulations of these norms in terms of squared ^-norms, since 
many efficient methods are available to solve ^-regularized problems 
(e.g., linear system solvers for least-squares regression). 

5.1 Variational formulations for grouped Ix-norms 

In this section, we show on our motivating example of sums of ^-norms 
of subsets how such formulations arise (see, e.g., [56 1 E I I7 H 111011111] ). 
The variational formulation we have presented in Section 11.4.21 allows 
us to consider the following function H(w,r]) defined as 

3=1 L g&5,j&g } geS 

It is jointly convex in (w,rj); the minimization with respect to rj can 
be done in closed form, and the optimum is equal to F(w) = f(w) + 

59 
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XO,(w); as for the minimization with respect to w, it is an ^-regularized 
problem. 

Unfortunately, the alternating minimization algorithm that is im- 
mediately suggested is not convergent in general, because the func- 
tion H is not continuous (in particular around r\ which has zero coor- 
dinates). In order to make the algorithm convergent, two strategies are 
commonly used: 

— Smoothing: we can add a term of the form | ]C 9 eS ^s^' wrnc h 
yields a joint cost function with compact level sets on the set of positive 
numbers. Alternating minimization algorithms are then convergent (as 
a consequence of general results on block coordinate descent), and have 
two different iterations: (1) minimization with respect to rj in closed 
form, through r] g = y/\\iVg\\2 + £, and (2) minimization with respect 
to w, which is an ^-regularized problem, which can be for example 
solved in closed form for the square loss. Note however, that the second 
problem does not need to be exactly optimized at all iterations. 

— First order method in r/: while the joint cost function 
H(r],w) is not continuous, the function I(r]) = mm w ^ P H(w, 77) is 
continuous, and under general assumptions, continuously differen- 
tiable, and is thus amenable to first-order methods (e.g., proximal 
methods, gradient descent). When the groups in 9 do not overlap, 
one sufficient condition is that the function f{w) is of the form 
f(w) = ip{Xw) for X G M nxp any matrix (typically the design 
matrix) and ip a strongly convex function on W 1 . This strategy is 
particularly interesting when evaluating I{rj) is computationally cheap. 

In theory, the alternating scheme consisting of optimizing alter- 
nately over 77 and w can be used to solve learning problems regu- 
larized with any norms: on top of the subquadratic norms defined in 
Section 11.4.21 we indeed show in the next section that any norm ad- 
mits a quadratic variational formulation (potentially defined from a 
non-diagonal symmetric positive matrix). To illustrate the principle of 
^2-reweighted algorithms, we first consider the special case of multiple 
kernel learning; in Section 15. 21 we consider the case of the trace norm. 
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Structured MKL. Reweighted-^2 algorithms are fairly natural for 
norms which admit a diagonal variational formulation (see Lemma ll. 81 
and [M]) and for the corresponding multiple kernel learning problem. 
We consider the structured multiple learning problem presented in Sec- 
tion ELTlZI 

The alternating scheme applied to Eq. (11.270 then takes the follow- 
ing form: for rj fixed, one has to solve a single kernel learning prob- 
lem with the kernel K = X^^i-^ij the corresponding solution in the 
product of RKHSs "K\ x . . . x IK P (see Section I3.4p is of the form 
h(x) = h\{x) + . . . + h p (x) with hi(x) = rji Y^j=i OijKi(xj, •)• Since 
II^hIIjc- = Vi ^ Ki a i f° r fixed ex., the update in rj then takes the form: 



Note that these updates produce a non-increasing sequence of values of 
the primal objective. Moreover, this MKL optimization scheme uses a 
potentially much more compact parameterization than proximal meth- 
ods since in addition to the variational parameter rj £M P & single vector 
of parameters cx 6 M n is needed as opposed to up to one such vector 
for each kernel in the case of proximal methods. MKL problems can 
also be tackled using first order methods in r] described above: we refer 
the reader to |lllj for an example in the case of classical MKL. 

5.2 Quadratic Variational Formulation for General Norms 

We now investigate a general variational formulation of norms that nat- 
urally leads to a sequence of reweighted ^-regularized problems. The 
formulation is based on approximating the unit ball of a norm f2 with 
enclosing ellipsoids. See Figure 15. 1[ The following proposition shows 
that all norms may be expressed as a minimum of Euclidean norms: 



Proposition 5.1. Let f2 : MP — > M. a norm on MP, then there exists a 
function g defined on the cone of positive semi-definite matrices Sp , 
such that g is convex, strictly positive except at zero, positively homo- 
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Fig. 5.1: Example of a sparsity-inducing ball in two dimensions, with 
enclosing ellipsoids. Left: ellipsoids with general axis for the ^i-norm; 
middle: ellipsoids with horizontal and vertical axis for the ^i-norm; 
right: ellipsoids for another polyhedral norm. 

geneous and such that for all w £ W, 

Q(w) = min V w T A~ 1 w = - min {w T A~ 1 w+g(A)\ . (5.1) 
Aes+, s(A)<i 2 Aes+ 



Proof. Let f2* be the dual norm of f2, defined as = 
max !](ic)^i l " Ts 2"o|. Let g be the function defined through g(A) = 
max fl*(s)ci s T As. This function is well-defined as the maximum of 
a continuous function over a compact set; moreover, as a maximum 
of linear functions, it is convex and positive homogeneous. Also, for 
nonzero A, the quadratic form s i-> s T As is not identically zero around 
s = 0, hence the strict positivity of g. 

Let w £ MP and A 6 ; there exists s such that fi*(s) = 1 and 
w T s = Q(w). We then have 

fl{w) 2 = (w T s) 2 ^ {w T A- 1 w){s T As) ^ g(A)(w T A- 1 w). 

This shows that Q(w) sj min AgS + 9 (a)<i v 7 w j A~ l w. Proving the 
other direction can be done using the following limiting argument. 
Given wq G K p , consider A(e) = (1 — £)wqWq + e(wqWq)I. We 
have WqA(e)~ 1 wo = 1 and g(A(e)) — >■ g(woWQ) = Q(wo) 2 . Thus, for 

A(e) = A(e)/g(A(e)), we have that J Wq A(e)^ 1 wq tends to O(iwq), 
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thus O(iwo) must be no smaller than the minimum over all A. The 
right-hand side of Eq. (|5.ip can be obtained by optimizing over the 
scale of A. □ 

Note that while the proof provides a closed-form expression for a 
candidate function g, it is not unique, as can be seen in the following 
examples. The domain of g (matrices so that g is finite) may be reduced 
(in particular to diagonal matrices for the ^i-norm and more generally 
the sub-quadratic norms defined in Section fl.4.2p : 

— For the £i-norm: using the candidate from the proof, we have 
g(A) = max|| s || oo<1 s T As, but we could use g(A) = Tr A if A is diago- 
nal and +00 otherwise. 

— For subquadratic norms (Section ll.4.2|) . we can take g(A) to be 
+00 for non-diagonal A, and either equal to the gauge function of the 
set H, i.e. the function s 1— > inf{f G R + | s G isH}, or equal to the 
function Q defined in Lemma ll.l| both applied to the diagonal of A. 

— For the ^2-norm: 5(A) = A max (A) but we could of course use 
g(A) = 1 if A = I and +00 otherwise. 

— For the trace norm: w is assumed to be of the form w = vect(W) 
and the trace norm of W is regularized. The trace norm admits the 
variational form (see [6]) : 

IIWIL = - inf tr(W T D- 1 W + D) s.t. D y 0. (5.2) 

2 .DhO 

But tr (W J D~ l W) = w T (I (g) D) ^w, which shows that the reg- 
ularization by the trace norm takes the form of Eq. (|5.1|) in which we 
can choose g(A) equal to tr(D) if A = J® D for some D y and +00 
otherwise. 

The solution of the above optimization problem is given by D* = 
(WW T ) 1 / 2 which can be computed via a singular value decomposition 
of W. The reweighted-^2 algorithm to solve 

min f(W) + X\\W\L 

w&RP xk 

therefore consists of iterating between the two updates (see, e.g., [B] for 
more details): 

W <- argmin/(W) + - Tr(W J D l W) and 
w 2 
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D <- (WW T +el k ) 1/2 , 

where e is a smoothing parameter that arises from adding a term 
^ Tr(Z? _1 ) to Eq. (j5.2H and prevents the matrix from becoming sin- 
gular. 



6 



Working-Set and Homotopy Methods 



In this section, we consider methods that explicitly take into account 
the fact that the solutions are sparse, namely working set methods and 
homotopy methods. 

6.1 Working-Set Techniques 

Working-set algorithms address optimization problems by solving an 
increasing sequence of small subproblems of (jl.ip . which we recall can 
be written as 

min f(w) + \n(w), 

where / is a convex smooth function and Q a sparsity-inducing norm. 
The working set, which we denote by J, refers to the subset of variables 
involved in the optimization of these subproblems. 

Working-set algorithms proceed as follows: after computing a so- 
lution to the problem restricted to the variables in J (while setting 
the variables in J c to zero), global optimality is checked to determine 
whether the algorithm has to continue. If this is the case, new variables 
enter the working set J according to a strategy that has to be defined. 
Note that we only consider forward algorithms, i.e., where the working 
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set grows monotonically. In other words, there are no backward steps 
where variables would be allowed to leave the set J. Provided this as- 
sumption is met, it is easy to see that these procedures stop in a finite 
number of iterations. 

This class of algorithms is typically applied to linear programming 
and quadratic programming problems (see, e.g., [105]), and here takes 
specific advantage of sparsity from a computational point of view 

EOl HUH EZl H2H , since the subproblems that need to be solved 
are typically much smaller than the original one. 

Working-set algorithms require three ingredients: 

• Inner-loop solver: At each iteration of the working-set algo- 
rithm, problem (jl.ip has to be solved on J, i.e., subject to the addi- 
tional equality constraint that Wj = for all j in J c : 



The computation can be performed by any of the methods presented 
in this monograph. Working-set algorithms should therefore be viewed 
as "meta-algorithms" . Since solutions for successive working sets are 
typically close to each other the approach is efficient if the method 
chosen can use warm-restarts. 

• Computing the optimality conditions: Given a solution w* 
of problem (I6.ip . it is then necessary to check whether w* is also a 
solution for the original problem (jl.ip . This test relies on the duality 
gaps of problems (|6.ip and In particular, if w* is a solution of 

problem (|6.ip . it follows from Proposition 11.61 in Section [1.41 that 



In fact, the Lagrangian parameter associated with the equality con- 
straint ensures the feasibility of the dual variable formed from the 
gradient of / at w*. In turn, this guarantees that the duality gap of 
problem ()6.ip vanishes. The candidate w* is now a solution of the full 
problem (jl.ip . i.e., without the equality constraint wjc = 0, if and only 



w&RP 



min f(w) + Afi(xo), such that wjc = 0. 



(6.1) 



/(o + ao(o + /* (v/«» = o. 



if 



n*(v/(«>*)) < a. 



(6.2) 
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Condition (16, 2p points out that the dual norm f2* is a key quantity 
to monitor the progress of the working-set algorithm |67] , In simple 
settings, for instance when f2 is the £i-norm, checking condition (|6,2I) 
can be easily computed since Q* is just the ^-norm. In this case, 
condition (|6.2p becomes 

|[V/(u>*)] 3 -| < A, for all jm{l,...,p}. 

Note that by using the optimality of problem (16. If) . the components of 
the gradient of / indexed by J are already guaranteed to be no greater 
than A. 

• Strategy for the growth of the working set: If condi- 
tion (]6.2p is not satisfied for the current working set J, some inactive 
variables in J c have to become active. This point raises the questions of 
how many and how these variables should be chosen. First, depending 
on the structure of fi, a single or a group of inactive variables have 
to be considered to enter the working set. Furthermore, one natural 
way to proceed is to look at the variables that violate condition (16. 2p 
most. In the example of ^-regularized least squares regression with 
normalized predictors, this strategy amounts to selecting the inactive 
variable that has the highest correlation with the current residual. 

The working-set algorithms we have described so far aim at solv- 
ing problem (jl.ip for a fixed value of the regularization parameter A. 
However, for specific types of loss and regularization functions, the set 
of solutions of problem (II. ID can be obtained efficiently for all possible 
values of A, which is the topic of the next section. 

6.2 Homotopy Methods 

We present in this section an active-se10 method for solving the Lasso 
problem of Eq. (jl.8p . We recall the Lasso formulation: 

min — \\y — Xw\\% + Allwlli, (6.3) 
weRp 2n 



Active-set and working-set methods are very similar; they differ in that active-set methods 
allow (or sometimes require) variables returning to zero to exit the set. 
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0.1 0.2 0.3 0.4 0.5 0.6 
Regularization parameter 



Fig. 6.1: The weights w*(X) are represented as functions of the regu- 
larization parameter A. When A increases, more and more coefficients 
are set to zero. These functions are all piecewise affine. Note that some 
variables (here one) may enter and leave the regularization path. 



where y is in M. n , and X is a design matrix in M nxp . Even though 
generic working-set methods introduced above could be used to solve 
this formulation (see, e.g., [80j), a specific property of the ^i-norm 
associated with a quadratic loss makes it possible to address it more 
efficiently. 

Under mild assumptions (which we will detail later), the solution of 
Eq. (16, 3D is unique, and we denote it by w*(X) for a given regularization 
parameter A > 0. We use the name regularization path to denote the 
function A i— > w*(X) that associates to a regularization parameter A 
the corresponding solution. We will show that this function is piecewise 
linear, a behavior illustrated in Figure IB~T1 where the entries of w*(X) 
for a particular instance of the Lasso are represented as functions of A. 

An efficient algorithm can thus be constructed by choosing a partic- 
ular value of A, for which finding this solution is trivial, and by follow- 
ing the piecewise affine path, computing the directions of the current 
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affine parts, and the points where the direction changes (also known 
as kinks). This piecewise linearity was first discovered and exploited in 
|91| in the context of portfolio selection, revisited in |1U9| describing 
a homotopy algorithm, and studied in [35] with the LARS algorithmic 
Similar ideas also appear early in the optimization literature: Finding 
the full regularization path of the Lasso is in fact a particular instance 
of a parametric quadratic programming problem, for which path follow- 
ing algorithms have been developed [115]. 

Let us show how to construct the path. From the optimality con- 
ditions we have presented in Eq. (II. 9p . denoting by J := {j; \Xj (y — 
Xw*)\ = nX} the set of active variables, and defining the vector t in 
{— 1; 0; 1} P as t := sign(X T (j/ — It»*)), we have the following closed- 
form expression 

w}(\) =(XjXj)- l (Xjy-n\tj) 
w* JC (X) =0, 

where we have assumed the matrix Xj Xj to be invertible (which is 
a sufficient condition to guarantee the uniqueness of w*). This is an 
important point: if one knows in advance the set J and the signs t j, 
then w*(X) admits a simple closed-form. Moreover, when J and tj 
are fixed, the function A i— > (Xj X J )~ 1 (Xjy — nXtj) is affine in A. 
With this observation in hand, we can now present the main steps of 
the path-following algorithm. It basically starts from a trivial solution 
of the regularization path, follows the path by exploiting this formula, 
updating J and tj whenever needed so that optimality conditions (jl.9|) 
remain satisfied. This procedure requires some assumptions — namely 
that (a) the matrix Xj Xj is always invertible, and (b) that updating 
J along the path consists of adding or removing from this set a single 
variable at the same time. Concretely, we proceed as follows: 

(1) Set A to - \\X T y\\ 00 for which it is easy to show from Eq. (|1.9|) 
that u>*(A) = (trivial solution on the regularization path). 

(2) Set J := {j; \Xjy\ = nX}. 



2 Even though the basic version of LARS is a bit different from the procedure we have 
just described, it is closely related, and indeed a simple modification makes it possible to 
obtain the full regularization path of Eq. I jf .81 . 
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(3) Follow the regularization path by decreasing the value of A, 
with the formula Wj(\) = (Xj X J )~ 1 (Xj y — n\tj) keeping 
Wj c = 0, until one of the following events (kink) occurs 

• There exists j in J c such that \Xj (y — Xw*)\ = nX. 
Then, add j to the set J. 

• There exists j in J such that a non-zero coefficient 
Wj hits zero. Then, remove j from J. 

We assume that only one of such events can occur at the 
same time (b). It is also easy to show that the value of A 
corresponding to the next event can be obtained in closed 
form such that one can "jump" from a kink to another. 

(4) Go back to [3 

Let us now briefly discuss assumptions (a) and (b). When the ma- 
trix XjXj is not invertible, the regularization path is non-unique, 
and the algorithm fails. This can easily be fixed by addressing instead 
a slightly modified formulation. It is possible to consider instead the 
elastic- net formulation [159j that uses Q(w) = A||ti>[|i + ?[|«j[||. Indeed, 
it amounts to replacing the matrix Xj Xj by Xj Xj + wyl, which is 
positive definite and therefore always invertible, and to apply the same 
algorithm. The second assumption (b) can be unsatisfied in practice 
because of the machine precision. To the best of our knowledge, the 
algorithm will fail in such cases, but we consider this scenario unlikely 
with real data, though possible when the Lasso/basis pursuit is used 
multiple times such as in dictionary learning, presented in Section [7.31 
In such situations, a proper use of optimality conditions can detect 
such problems and more stable algorithms such as proximal methods 
may then be used. 

The complexity of the above procedure depends on the number of 
kinks of the regularization path (which is also the number of iterations 
of the algorithm). Even though it is possible to build examples where 
this number is large, we often observe in practice that the event where 
one variable gets out of the active set is rare. The complexity also 
depends on the implementation. By maintaining the computations of 
Xj (y — Xw*) and a Cholesky decomposition of (Xj _Xj) _1 , it is possi- 
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ble to obtain an implementation in 0(psn+ps 2 + s 3 ) operations, where 
s is the sparsity of the solution when the algorithm is stopped (which 
we approximately consider as equal to the number of iterations). The 
product psn corresponds to the computation of the matrices Xj Xj, 
ps 2 to the updates of the correlations Xj (y — Xw*) along the path, 
and s 3 to the Cholesky decomposition. 



7 



Sparsity and Nonconvex Optimization 



In this section, we consider alternative approaches to sparse modelling, 
which are not based in convex optimization, but often use convex op- 
timization problems in inner loops. 

7.1 Greedy Algorithms 

We consider the ^-constrained signal decomposition problem 

min — III/ — JSCit? ||o s.t. [I wlln < s, (7.1) 

weRp 2n 

where s is the desired number of non-zero coefficients of the solution, 
and we assume for simplicity that the columns of X have unit norm. 
Even though this problem can be shown to be NP-hard |97j . greedy 
procedures can provide an approximate solution. Under some assump- 
tions on the matrix X, they can also be shown to have some optimality 
guarantees [137] . 

Several variants of these algorithms with different names have been 
developed both by the statistics and signal processing communities. In 
a nutshell, they are known as forward selection techniques in statis- 
tics (see |146j ). and matching pursuit algorithms in signal process- 
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ing |90| . All of these approaches start with a null vector w, and it- 
eratively add non-zero variables to w until the threshold s is reached. 

The algorithm dubbed matching pursuit, was introduced in the 90's 
in [90], and can be seen as a non-cyclic coordinate descent procedure 
for minimizing Eq. (|7.ip . It selects at each step the column x 1 that is 
the most correlated with the residual according to the formula 

% <— argmm \r x |, 

ie{i,..., P } 

where r denotes the residual y — Xw. Then, the residual is projected 
on x l and the entry is updated according to 

Wi <— Wi + r 1 x % 
r 4- r — (r T x l )x l . 

Such a simple procedure is guaranteed to decrease the objective func- 
tion at each iteration, but is not to converge in a finite number of steps 
(the same variable can be selected several times during the process). 
Note that such a scheme also appears in statistics in boosting proce- 
dures [5T] . 

Orthogonal matching pursuit [90| was proposed as a major variant 
of matching pursuit that ensures the residual of the decomposition to 
be always orthogonal to all previously selected columns of X. Such tech- 
nique existed in fact in the statistics literature under the name forward 
selection [145], and a particular implementation exploiting a QR ma- 
trix factorization also appears in [97]. More precisely, the algorithm is 
an active set procedure, which sequentially adds one variable at a time 
to the active set, which we denote by J. It provides an approximate 
solution of Eq. (|7.ip for every value s' < s, and stops when the de- 
sired level of sparsity is reached. Thus, it builds a regularization path, 
and shares many similarities with the homotopy algorithm for solving 
the Lasso [45] . even though the two algorithms address different opti- 
mization problems. These similarities are also very strong in terms of 
implementation: Identical tricks as those described in Section 16.21 for 
the homotopy algorithm can be used, and in fact both algorithms have 
roughly the same complexity (if most variables do not leave the path 
once they have entered it). At each iteration, one has to choose which 
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new predictor should enter the active set J. A possible choice is to 
look for the column of X most correlated with the residual as in the 
matching pursuit algorithm, but another criterion is to select the one 
that helps most reducing the objective function 

1 .. 

i *r- argmm mm —\\y - X Ju i i} w 2 . 

Whereas this choice seem at first sight computationally expensive since 
it requires solving | J c \ least-squares problems, the solution can in fact 
be obtained efficiently using a Cholesky matrix decomposition of the 
matrix Xj Xj and basic linear algebra, which we will not detail here 
for simplicity reasons (see [40J for more details). 

After this step, the active set is updated J <- JU {«}, and the 
corresponding residual r and coefficients w are 

w <- (XjXjy'xjy, 

r^(I-Xj(XjX J )- 1 X])y, 

where r is the residual of the orthogonal projection of y onto the linear 
subspace spanned by the columns of Xj. It is worth noticing that one 
does not need to compute these two quantities in practice, but only 
updating the Cholesky decomposition of (Xj Xj)~ 1 and computing 
directly X T r, via simple linear algebra relations. 

For simplicity, we have chosen to present matching pursuit algo- 
rithms for solving the £o- s P arse approximation problem, but they admit 
several variants (see [99 J for example), or extensions when the regular- 
ization is more complex than the £o-P enar ty or for other loss functions 
than the square loss. For instance, they are used in the context of 
non-convex group-sparsity in |138j . or with structured sparsity formu- 
lations [ig Eg. 

We also remark that other possibilities than greedy methods exists 
for optimizing Eq. (17. ip . One can notably use the algorithm ISTA (i.e., 
the non-accelerated proximal method) presented in Section [3] when the 
function / is convex and its gradient Lipschitz continuous. Under this 
assumption, it is easy to see that ISTA can iteratively decrease the value 
of the nonconvex objective function. Such proximal gradient algorithms 
when f2 is the lo-penalty often appear under the name of iterative hard- 
thresholding methods [60J. 
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7.2 Reweighted-£i Algorithms with DC-Programming 

We focus in this section on optimization schemes for a certain type of 
non-convex regularization functions. More precisely, we consider prob- 
lem (jl.ip when f2 is a nonconvex separable penalty that can be written 
as n(w) := X)f=i C{\ w i\)i where w is in R p , and ( : M + — > M + is a con- 
cave non-decreasing differentiable function. In other words, we address 

v 

min f(w) + A VC(KI), (7.2) 

weRP ^-^ 
i=l 

where / is a convex smooth function. Examples of such penalties in- 
clude variants of the ^-penalties for q < 1 defined as C '■ t l— (\t\ + £) q -, 
log-penalties ( : i ^ log(|t| + e), where e > makes the function ( 
differentiable at 0. Other nonconvex regularization functions have been 
proposed in the statistics community, such as the SCAD penalty [48J . 

The main motivation for using such penalties is that they induce 
more sparsity than the ^i-norm, while they can be optimized with con- 
tinuous optimization as opposed to greedy methods. The unit balls 
corresponding to the £ q pseudo-norms and norms are illustrated in 
Figure 17.11 for several values of q. When q decreases, the ^-ball ap- 
proaches in a sense the ^o-ball, which allows to induce sparsity more 
aggressively. 




(a) ^o-ball. (b) £ .5-ball. (c) ^i-ball. (d) ^ 2 -ball. 



Fig. 7.1: Unit balls in 2D corresponding to ^-penalties. 

Even though the optimization problem (|7.2p is not convex and not 
smooth, it is possible to iteratively decrease the value of the objective 
function by solving a sequence of convex problems. Algorithmic schemes 
of this form appear early in the optimization literature in a more gen- 
eral framework for minimizing the difference of two convex functions 
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(or equivalently the sum of a convex and a concave function), which is 
called DC programming (see [53J and references therein). Even though 
the objective function of Equation (|7.2|) is not exactly a difference of 
convex functions (it is only the case on R+), the core idea of DC pro- 
gramming can be applied. We note that these algorithms were recently 
revisited under the particular form of reweighted-£i algorithms [30]. 
The idea is relatively simple. We denote by g : MP — > R the objective 
function which can be written as g(w) := f(w) + A^^ =1 £(|u7j|) for a 
vector w in R p . This optimization scheme consists in minimizing, at 
iteration k of the algorithm, a convex upper bound of the objective 
function g, which is tangent to the graph of g at the current esti- 
mate w k . 

A surrogate function with these properties is obtained easily by 
exploiting the concavity of the functions Q on R + , which always lie 
below their tangents, as illustrated in Figure TL2[ The iterative scheme 
can then be written 

p 

w k+1 ^ argmin / (w) + A YY ( K* | ) K | , 

which is a reweighted-^i sparse decomposition problem [30] . To ini- 
tialize the algorithm, the first step is usually a simple Lasso, with no 
weights. In practice, the effect of the weights C'(\ w i I) is to push to zero 
the smallest non-zero coefficients from iteration k — 1, and two or three 
iterations are usually enough to obtain the desired sparsifying effect. 
Linearizing iteratively concave functions to obtain convex surrogates is 
the main idea of DC programming, which readily applies here to the 
functions w \-t ((\w\). 

For simplicity we have presented these reweighted-^i algorithms 
when is separable. We note however that these optimization schemes 
are far more general and readily apply to non-convex versions of most 
of the penalties we have considered in this paper. For example, when 
the penalty has the form 

n(«) = £C(KII), 

963 

where £ is concave and differentiable on M + , S is a set of (possibly over- 
lapping) groups of variables and ||.|| is any norm. The idea is then simi- 
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(a) red: C(M) = MM + e). (b) red: f(w) + C(\w\). 

blue: convex surrogate f'(|ui fc |)|«)| + C . blue: f(w) + £'(|?A> fe |)|u>| + C . 



Fig. 7.2: Functions and their surrogates involved in the reweighted-^i 
optimization scheme in the one dimensional case (p = 1). The func- 
tion £ can be written here as CGH) = log(| w| + e) for a scalar w in R, 
and the function / is quadratic. The graph of the non-convex functions 
are represented in red and their convex "tangent" surrogates in blue. 

lar, iteratively linearizing for each group g the functions £ around \\w g \\ 
and minimizing the resulting convex surrogate (see an application to 
structured sparse principal component analysis in |71j). 

Finally, another possible approach to solve optimization problems 
regularized by nonconvex penalties of the type presented in this section 
is to use the reweighted-^2 algorithms of Section based on quadratic 
variational forms of such penalties (see, e.g., [7T]). 

7.3 Sparse Matrix Factorization and Dictionary Learning 

Sparse linear models for regression in statistics and machine learning 
assume a linear relation y ~ Xw, where y in R n is a vector of obser- 
vations, X in R nxp is a design matrix whose rows can be interpreted 
as features, and w is a weight vector in W . Similar models are used 
in the signal processing literature, where y is a signal approximated 
by a linear combination of columns of X , which are called dictionary 
elements, or basis element when X is orthogonal. 

Whereas a lot of attention has been devoted to cases where X is 
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fixed and pre-defined, another line of work considered the problem of 
learning X from training data. In the context of sparse linear models, 
this problem was first introduced in the neuroscience community by 
Olshausen and Field [108J to model the spatial receptive fields of simple 
cells in the mammalian visual cortex. Concretely, given a training set 
of q signals Y = [y 1 , . . . , y q ] in R" X|J , one would like to find a dictionary 
matrix X in W ixp and a coefficient matrix W = [w 1 , . . . , w q ] in MP xq 
such that each signal y % admits a sparse approximation Xw r . In other 
words, we want to learn a dictionary X and a sparse matrix W such 
that Y « XW. 

A natural formulation is the following non-convex matrix factoriza- 
tion problem: 



where is a sparsity-inducing penalty function, and X C W nxp is a 
convex set, which is typically the set of matrices whose columns have 
£2-norm less or equal to one. Without any sparse prior (i.e., for A = 0), 
the solution of this factorization problem is obtained through princi- 
pal component analysis (PC A) (see, e.g., [28] and references therein). 
However, when A > 0, the solution of Eq. (|7.3p has a different behavior, 
and may be used as an alternative to PCA for unsupervised learning. 

A successful application of this approach is when the vectors y % are 
small natural image patches, for example of size n = 10 x 10 pixels. 
A typical setting is to have an overcomplete dictionary — that is, the 
number of dictionary elements can be greater than the signal dimension 
but small compared to the number of training signals, for example p = 
200 and q = 100 000. For this sort of data, dictionary learning finds 
linear subspaces of small dimension where the patches live, leading to 
effective applications in image processing [46 1 . Examples of a dictionary 
for image patches is given in Figure E3J 

In terms of optimization, Eq. ()7.3p is nonconvex and no known al- 
gorithm has a guarantee of providing a global optimum in general, 
whatever the choice of penalty £1 is. A typical approach to find a lo- 
cal minimum is to use a block-coordinate scheme, which optimizes X 
and W in turn, while keeping the other one fixed |47| . Other alterna- 




(7.3) 
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Fig. 7.3: Left: Example of dictionary with p = 256 elements, learned 
on a database of natural 12 x 12 image patches when Q is the £1 -norm. 
Right: Dictionary with p = 400 elements, learned with a structured 
sparsity-inducing penalty f2 (see [89J). 



tives include the K-SVD algorithm [5] (when f2 is the ^crP ena hy), an d 
online learning techniques |87t 1108] that have proven to be particularly 
efficient when the number of signals q is largeQ Convex relaxations of 
dictionary learning have also been proposed in [141 [26] . 

7.4 Bayesian Methods 

While the focus of this monograph is on frequentist approaches to spar- 
sity, and particularly on approaches that minimize a regularized empir- 
ical risk, there naturally exist several BayesiarH approaches to sparsity. 

As a first remark, regularized optimization can be viewed as solving 
a maximum a posteriori (MAP) estimation problem if the loss i (cf. 
Sec. II. 2p defining / can be interpreted as a log-likelihood and the norm 
as certain log-prior. Typically, the £i-norm can for instance be inter- 



1 Such efficient algorithms are freely available in the open-source software package SPAMS 
http : //www . di . ens . f r/willow/SPAMS/ 

2 Bayesian methods can of course not be reduced to nonconvex optimization, but given 
that they are often characterized by multimodality and that corresponding variational 
formulations are typically nonconvex, we conveniently discuss them here. 
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preted as the logarithm of a product of independent Laplace priors on 
the loading vectors w (see, e.g., [124] ). However, the Laplace distribu- 
tion is actually not a sparse prior, in the sense that it is a continuous 
distribution whose samples are thus nonzero almost surely. Besides the 
fact that MAP estimation is generally not considered as a Bayesian 
method (the fundamental principle of Bayesian method is to integrate 
over the uncertainty and avoid point estimates), evidence in the liter- 
ature suggests that MAP is not a good principle to yield estimators 
that are adapted to the corresponding prior. In particular, the Lasso 
does in fact not provide a good algorithm to estimate vectors whose 
coefficients follow a Laplace distribution [57j ! 

To obtain exact zeros in a Bayesian setting, one must use so called 
"spike and slab" priors [63] . Inference with such priors leads to noncon- 
vex optimization problems, and sampling methods, while also simple 
to implement, do not come with any guarantees, in particular in high- 
dimensional settings. 

In reality, while obtaining exact zeros can be valuable from a compu- 
tational point of view, it is a priori not necessary to obtain theoretical 
guarantees associated with sparse methods. In fact, sparsity should be 
understood as the requirement or expectation that a few coefficients 
are significantly larger than most of the rest, an idea which is some- 
how formalized as compressibility in the compressed sensing literature, 
and which inspires automatic relevance determination methods (ARD) 
and the use of heavy-tail priors among Bayesian approaches [98, 148J. 
Using heavy-tailed prior distribution on Wj allows to obtain posterior 
estimates with many small values and a few large values, this effect 
being stronger when the tails are heavier, in particular with Student's 
t-distribution. Heavy-tailed distributions and ARD are very related 
since these prior distributions can be expressed as scaled mixture of 
Gaussians (SJET]. This is of interest computationally, in particular for 
variational methods. 

Variable selection is also obtained in a Bayesian setting by optimiz- 
ing the marginal likelihood of the data over the hyper-parameters, that 
is using empirical Bayes estimation; in that context iterative methods 
based on DC programming may be efficiently used [147J. 

It should be noted that the heavy-tail prior formulation points to an 
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interesting connection between sparsity and the notion of robustness in 
statistics, in which a sparse subset of the data is allowed to take large 
values. This is is also suggested by work such as |149|, 1154] . 



8 

Quantitative Evaluation 



To illustrate and compare the methods presented in this paper, we 
consider in this section three benchmarks. These benchmarks are cho- 
sen to be representative of problems regularized with sparsity-inducing 
norms, involving different norms and different loss functions. To make 
comparisons that are as fair as possible, each algorithm is implemented 
in C/C++, using efficient BLAS and LAPACK libraries for basic lin- 
ear algebra operations. Most of these implementations have been made 
available in the open-source software SPAMlfl All subsequent simula- 
tions are run on a single core of a 3.07Ghz CPU, with 8GB of memory. 
In addition, we take into account several criteria which strongly influ- 
ence the convergence speed of the algorithms. In particular, we consider 

(a) different problem scales, 

(b) different levels of correlations between input variables, 

(c) different strengths of regularization. 

We also show the influence of the required precision by monitoring the 
time of computation as a function of the objective function. 



1 http : //www . di . ens . f r/willow/SPAMS/ 
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For the convenience of the reader, we list here the algorithms com- 
pared and the acronyms we use to refer to them throughout this 
section: the homotopy/LARS algorithm (LARS), coordinate-descent 
(CD), reweighted-^2 schemes (Re-^2)) simple proximal method (ISTA) 
and its accelerated version (FISTA). Note that all methods except the 
working set methods are very simple to implement as each iteration 
is straightforward (for proximal methods such as FISTA or ISTA, as 
long as the proximal operator may be computed efficiently). On the 
contrary, as detailed in Section 16.21 homotopy methods require some 
care in order to achieve the performance we report in this section. 

We also include in the comparisons generic algorithms such as a sub- 
gradient descent algorithm (SG), and a commercial softwar^] for cone 
(CP), quadratic (QP) and second-order cone programming (SOCP) 
problems. 

8.1 Speed Benchmarks for Lasso 

We first present a large benchmark evaluating the performance of var- 
ious optimization methods for solving the Lasso. 

We perform small-scale (n = 200, p = 200) and medium-scale 
(n = 2 000,p = 10 000) experiments. We generate design matrices as 
follows. For the scenario with low correlations, all entries of X are in- 
dependently drawn from a Gaussian distribution N(0, 1/n), which is a 
setting often used to evaluate optimization algorithms in the literature. 
For the scenario with large correlations, we draw the rows of the ma- 
trix X from a multivariate Gaussian distribution for which the average 
absolute value of the correlation between two different columns is eight 
times the one of the scenario with low correlations. Test data vectors 
y = Xw + n where w are randomly generated, with two levels of spar- 
sity to be used with the two different levels of regularization; n is a 
noise vector whose entries are i.i.d. samples from a Gaussian distribu- 
tion N(0, 0.01|| JCtwHj/n). In the low regularization setting the sparsity 
of the vectors w is s = 0.5min(n,p), and in the high regularization 
one s = 0.01 mm(n,p), corresponding to fairly sparse vectors. For SG, 



2 Mosek, available at http://www.mosek.com/. 
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we take the step size to be equal to a/(k + b), where k is the iteration 
number, and (a, b) are the besJI parameters selected on a logarithmic 
grid (a, b) £ {10 3 , . . . , 10} x {10 2 , 10 3 , 10 4 }; we proceeded this way not 
to disadvantage SG by an arbitrary choice of stepsize. 

To sum up, we make a comparison for 8 different conditions (2 scales 
x 2 levels of correlation x 2 levels of regularization) . All results are 
reported on Figures I8.H 18,21 by averaging 5 runs for each experiment. 
Interestingly, we observe that the relative performance of the different 
methods change significantly with the scenario. 

Our conclusions for the different methods are as follows: 

• L ARS/homotopy methods: For the small-scale problem, 
LARS outperforms all other methods for almost every sce- 
nario and precision regime. It is therefore definitely the right 
choice for the small-scale setting. Unlike first-order meth- 
ods, its performance does not depend on the correlation of 
the design matrix X, but rather on the sparsity s of the so- 
lution. In our larger scale setting, it has been competitive 
either when the solution is very sparse (high regularization), 
or when there is high correlation in X (in that case, other 
methods do not perform as well). More importantly, the ho- 
motopy algorithm gives an exact solution and computes the 
regularization path. 

• Proximal methods (ISTA, FISTA): FISTA outperforms 
ISTA in all scenarios but one. Both methods are close for 
high regularization or low correlation, but FISTA is signifi- 
cantly better for high correlation or/and low regularization. 
These methods are almost always outperformed by LARS 
in the small-scale setting, except for low precision and low 
correlation. 

Both methods suffer from correlated features, which is con- 
sistent with the fact that their convergence rate depends on 
the correlation between input variables (convergence as a ge- 
ometric sequence when the correlation matrix is invertible, 



3 "The best step size" is understood here as being the step size leading to the smallest 
objective function after 500 iterations. 



8.1. Speed Benchmarks for Lasso 85 




-3 -2 -1 o 
log(CPU time) in seconds 

(a) corr: low, reg: low 







SG 






Fista 






-0- Ista 




! \* 


Re-L2 






-l-CD 






Lars 






CP 


'.9 i 




WQP 










I 


■ 







-3 -2 -1 
log(CPU time) in seconds 

(b) corr: low, reg: high 




-3 -2 -1 
log{CPU time) in seconds 

(c) corr: high, reg: low 




-3 -2 -1 
log(CPU time) in seconds 

(d) corr: high, reg: high 



Fig. 8.1: Benchmark for solving the Lasso for the small-scale experiment 
(n = 200, p = 200), for the two levels of correlation and two levels of 
regularization, and 8 optimization methods (see main text for details). 
The curves represent the relative value of the objective function as a 
function of the computational time in second on a log 10 / logio scale. 



and as the inverse of a degree- two polynomial otherwise). 
They are well adapted to large-scale settings, with low or 
medium correlation. 
• Coordinate descent (CD): The theoretical analysis of 
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Fig. 8.2: Benchmark for solving the Lasso for the medium-scale exper- 
iment n = 2 000, p = 10 000, for the two levels of correlation and two 
levels of regularization, and 8 optimization methods (see main text for 
details). The curves represent the relative value of the objective func- 
tion as a function of the computational time in second on a log 10 / log 10 
scale. 



these methods suggest that they behave in a similar way to 
proximal methods [1041 1125] . However, empirically, we have 
observed that the behavior of CD often translates into a first 
"warm-up" stage followed by a fast convergence phase. 
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Its performance in the small-scale setting is competitive (even 
though always behind LARS), but less efficient in the large- 
scale one. For a reason we cannot explain, it suffers less than 
proximal methods from correlated features. 

• Reweighted-^2 : This method was outperformed in all our 
experiments by other dedicated methods @ Note that we con- 
sidered only the smoothed alternating scheme of Section 
and not first order methods in rj such as that of A 
more exhaustive comparison should include these as well. 

• Generic Methods (SG, QP, CP): As expected, generic 
methods are not adapted for solving the Lasso and are always 
outperformed by dedicated ones such as LARS. 

Among the methods that we have presented, some require an ini- 
tial overhead computation of the Gram matrix X T X: this is the case 
for coordinate descent and reweighted-^2 methods. We took into ac- 
count this overhead time in all figures, which explains the behavior of 
the corresponding convergence curves. Like homotopy methods, these 
methods could also benefit from an offline pre-computation of X T X 
and would therefore be more competitive if the solutions corresponding 
to several values of the regularization parameter have to be computed. 

We have considered in the above experiments the case of the square 
loss. Obviously, some of the conclusions drawn above would not be valid 
for other smooth losses. On the one hand, the LARS does no longer 
apply; on the other hand, proximal methods are clearly still available 
and coordinate descent schemes, which were dominated by the LARS in 
our experiments, would most likely turn out to be very good contenders 
in that setting. 

8.2 Group-Sparsity for Multi-task Learning 

For ^-regularized least-squares regression, homotopy methods have ap- 
peared in the previous section as one of the best techniques, in almost 

4 Note that the reweighted-^2 scheme requires solving iteratively large-scale linear system 
that are badly conditioned. Our implementation uses LAPACK Cholesky decompositions, 
but a better performance might be obtained using a pre-conditioned conjugate gradient, 
especially in the very large scale setting. 
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all the experimental conditions. 

This second speed benchmark explores a setting where this ho- 
motopy approach cannot be applied anymore. In particular, we con- 
sider a multi-class classification problem in the context of cancer di- 
agnosis. We address this problem from a multi-task viewpoint [107J. 
To this end, we take the regularizer to be l\jti- and ^i/^oo-norms, 
with (non-overlapping) groups of variables penalizing features across 
all classes [831 I1U7| . As a data- fitting term, we now choose a simple 
"1-vs-all" logistic loss function. 

We focus on two multi-class classification problems in the "small 
n, large p" setting, based on two datasets^] of gene expressions. The 
medium-scale dataset contains n = 83 observations, p = 2 308 variables 
and 4 classes, while the large-scale one contains n = 308 samples, p = 
15 009 variables and 26 classes. Both datasets exhibit highly-correlated 
features. 

In addition to ISTA, FISTA, and SG, we consider here the block 
coordinate-descent (BCD) from |141j presented in Section [H We also 
consider a working-set strategy on top of BCD, that optimizes over the 
full set of features (including the non-active ones) only once every four 
iterations. As further discussed in Section HI it is worth mentioning 
that the multi-task setting is well suited for the method of [141 j since 
an appropriate approximation of the Hessian can be easily computed. 

All the results are reported in Figures 18.31 and 18.41 As expected in 
the light of the benchmark for the Lasso, BCD appears as the best 
option, regardless of the sparsity /scale conditions. 

8.3 Structured Sparsity 

In this second series of experiments, the optimization techniques of the 
previous sections are further evaluated when applied to other types of 
loss and sparsity- inducing functions. Instead of the £i-norm previously 
studied, we focus on the particular hierarchical £i/£2-norm 0, intro- 
duced in Section 

From an optimization standpoint, although Q shares some similar- 



5 The two datasets we use are SRBCT and 14-Tumors, which are freely available at 
http : //www. gems-system. org/. 
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Fig. 8.3: Medium- and large-scale multi-class classification problems 
with an ^1/^2-regularization, for three optimization methods (see de- 
tails about the datasets and the methods in the main text). Three 
levels of regularization are considered. The curves represent the rela- 
tive value of the objective function as a function of the computation 
time in second on a log 10 / log 10 scale. In the highly regularized setting, 
the tuning of the step-size for the subgradient turned out to be difficult, 
which explains the behavior of SG in the first iterations. 



ities with the £i-norm (e.g., the convexity and the non-smoothness), it 
differs in that it cannot be decomposed into independent parts (because 
of the overlapping structure of 9)- Coordinate descent schemes hinge 
on this property and as a result, cannot be straightforwardly applied 
in this case. 
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Fig. 8.4: Medium- and large-scale multi-class classification problems 
with an ^i/^oo-regularization for three optimization methods (see de- 
tails about the datasets and the methods in the main text). Three 
levels of regularization are considered. The curves represent the rela- 
tive value of the objective function as a function of the computation 
time in second on a log 10 / log 10 scale. In the highly regularized setting, 
the tuning of the step-size for the subgradient turned out to be difficult, 
which explains the behavior of SG in the first iterations. 



8.3.1 Denoising of Natural Image Patches 

In this first benchmark, we consider a least-squares regression problem 
regularized by f2 that arises in the context of the denoising of natural 
image patches [69]. In particular, based on a hierarchical set of features 
that accounts for different types of edge orientations and frequencies in 
natural images, we seek to reconstruct noisy 16 x 16-patches. Although 
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Fig. 8.5: Benchmark for solving a least-squares regression problem reg- 
ularized by the hierarchical norm f2. The experiment is small scale, 
n = 256, p = 151, and shows the performances of five optimization 
methods (see main text for details) for three levels of regularization. 
The curves represent the relative value of the objective function as a 
function of the computational time in second on a log 10 / log 10 scale. 



the problem involves a small number of variables (namely p = 151), it 
has to be solved repeatedly for thousands of patches, at moderate pre- 
cision. It is therefore crucial to be able to solve this problem efficiently. 

The algorithms that take part in the comparisons are ISTA, FISTA, 
Re-^2, SG, and SOCP. All results are reported in Figure 1831 by aver- 
aging 5 runs. 

We can draw several conclusions from the simulations. First, we ob- 
serve that across all levels of sparsity, the accelerated proximal scheme 
performs better, or similarly, than the other approaches. In addition, as 
opposed to FISTA, ISTA seems to suffer in non-sparse scenarios. In the 
least sparse setting, the reweighted-^2 scheme matches the performance 
of FISTA. However this scheme does not yield truly sparse solutions, 
and would therefore require a subsequent thresholding operation, which 
can be difficult to motivate in a principled way. As expected, the generic 
techniques such as SG and SOCP do not compete with the dedicated 
algorithms. 
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8.3.2 Multi-class Classification of Cancer Diagnosis 

This benchmark focuses on multi-class classification of cancer diagnosis 
and reuses the two datasets from the multi-task problem of Section [H2J 
Inspired by [73] , we build a tree-structured set of groups of features S 
by applying Ward's hierarchical clustering |72] on the gene expressions. 
The norm f2 built that way aims at capturing the hierarchical structure 
of gene expression networks [73]. For more details about this construc- 
tion, see [68j in the context of neuroimaging. The resulting datasets 
with tree-structured sets of features contain p = 4615 and p = 30 017 
variables, for respectively the medium- and large-scale datasets. 

Instead of the square loss function, we consider the multinomial lo- 
gistic loss function, which is better suited for multi-class classification 
problems. As a direct consequence, the algorithms whose applicabil- 
ity crucially depends on the choice of the loss function are removed 
from the benchmark. This is for instance the case for reweighted-^2 
schemes that have closed-form updates available only with the square 
loss (see Section [5]). Importantly, the choice of the multinomial logistic 
loss function requires one to optimize over a matrix with dimensions 
p times the number of classes (i.e., a total of 4 615 x 4 ~ 18 000 and 
30017 x 26 ~ 780 000 variables). Also, for lack of scalability, generic 
interior point solvers could not be considered here. To summarize, the 
following comparisons involve ISTA, FISTA, and SG. 

All the results are reported in Figure IHUl The benchmark especially 
points out that the accelerated proximal scheme performs overall better 
that the two other methods. Again, it is important to note that both 
proximal algorithms yield sparse solutions, which is not the case for SG. 
More generally, this experiment illustrates the flexibility of proximal 
algorithms with respect to the choice of the loss function. 

8.3.3 General Overlapping Groups of Variables 

We consider a structured sparse decomposition problem with overlap- 
ping groups of foo-norms, and compare the proximal gradient algorithm 
FISTA [17] consider the proximal operator presented in Section [3T3] (re- 
ferred to as ProxFlow |88|). Since, the norm we use is a sum of sev- 
eral simple terms, we can bring to bear other optimization techniques 
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Fig. 8.6: Medium- and large-scale multi-class classification problems 
for three optimization methods (see details about the datasets and the 
methods in the main text). Three levels of regularization are considered. 
The curves represent the relative value of the objective function as a 
function of the computation time in second on a log 10 / log 10 scale. 
In the highly regularized setting, the tuning of the step-size for the 
subgradient turned out to be difficult, which explains the behavior of 
SG in the first iterations. 



which are dedicated to this situation, namely proximal splitting method 
known as alternating direction method of multipliers (ADMM) (see, 
e.g., pi [38]). We consider two variants, (ADMM) and (Lin- ADMM)— 
see more details in |89| . 

We consider a design matrix X in R nxp built from overcomplete 
dictionaries of discrete cosine transforms (DCT), which are naturally 
organized on one- or two-dimensional grids and display local correla- 
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tions. The following families of groups 9 using this spatial information 
are thus considered: (1) every contiguous sequence of length 3 for the 
one-dimensional case, and (2) every 3x3-square in the two-dimensional 
setting. We generate vectors y in R n according to the linear model 
y = Xwq + e, where e ~ N(0, 0.01||_X"iuo |||)- The vector wq has 
about 20% nonzero components, randomly selected, while respecting 
the structure of 9 , and uniformly generated in [—1,1]. 

In our experiments, the regularization parameter A is chosen to 
achieve the same level of sparsity (20%). For SG, ADMM and Lin- 
ADMM, some parameters are optimized to provide the lowest value 
of the objective function after 1 000 iterations of the respective al- 
gorithms. For SG, we take the step size to be equal to a/(k + 6), 
where k is the iteration number, and (a, b) are the pair of parame- 
ters selected in {10~ 3 , . . . , 10} x {10 2 , 10 3 , 10 4 }. The parameter 7 for 
ADMM is selected in {10~ 2 , . . . , 10 2 }. The parameters (7,(5) for Lin- 
ADMM are selected in {10~ 2 , . . . , 10 2 } x {10~\ . . . , 10 8 }. For interior 
point methods, since problem (jl.ip can be cast either as a quadratic 
(QP) or as a conic program (CP), we show in Figure 18.71 the re- 
sults for both formulations. On three problems of different sizes, with 
(n,p) G {(100, 10 3 ), (1024, 10 4 ), (1024, 10 5 )}, our algorithms ProxFlow, 
ADMM and Lin- ADMM compare favorably with the other methods, 
(see Figure \8.7\i . except for ADMM in the large-scale setting which 
yields an objective function value similar to that of SG after 10 4 sec- 
onds. Among ProxFlow, ADMM and Lin-ADMM, ProxFlow is con- 
sistently better than Lin-ADMM, which is itself better than ADMM. 
Note that for the small scale problem, the performance of ProxFlow and 
Lin-ADMM is similar. In addition, note that QP, CP, SG, ADMM and 
Lin-ADMM do not obtain sparse solutions, whereas ProxFlow does. 

8.4 General Comments 

We conclude this section by a couple of general remarks on the exper- 
iments that we presented. First, the use of proximal methods is often 
advocated because of their optimal worst case complexities in O(p-) 
(where t is the number of iterations). In practice, in our experiments, 
these and several other methods exhibit empirically convergence rates 
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Fig. 8.7: Speed comparisons: distance to the optimal primal value versus 
CPU time (log-log scale). Due to the computational burden, QP and 
CP could not be run on every problem. 



that are at least linear, if not better, which suggests that the adaptiv- 
ity of the method (e.g., its ability to take advantage of local curvature) 
might be more crucial to its practical success. Second, our experiments 
concentrated on regimes that are of interest for sparse methods in ma- 
chine learning where typically p is larger than n and where it is possi- 
ble to find good sparse solutions. The setting where n is much larger 
than p was out of scope here, but would be worth a separate study, 
and should involve methods from stochastic optimization. Also, even 
though it might make sense from an optimization viewpoint, we did not 
consider problems with low levels of sparsity, that is with more dense 
solution vectors, since it would be a more difficult regime for many 
of the algorithms that we presented (namely LARS, CD or proximal 
methods). 
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Extensions 



We obviously could not cover exhaustively the literature on algorithms 
for sparse methods in this chapter. 

Surveys and comparisons of algorithms for sparse methods have 
been proposed in |119| and |155j . These papers present quite a few al- 
gorithms, but focus essentially on £i-regularization and unfortunately 
do not consider proximal methods. Also, it is not clear that the met- 
rics used to compare the performance of various algorithms is the most 
relevant to machine learning; in particular, we present the full conver- 
gence curves that we believe are more informative than the ordering of 
algorithms at fixed precision. 

Beyond the material presented here, there a few topics that we did 
not develop and that are worth mentioning. 

In the section on proximal methods, we presented the proximal 
methods called forward-backward splitting methods. We applied them 
to objectives which are the sum of two terms: a differentiable func- 
tion with Lipschitz-continuous gradients and a norm. More generally 
these methods apply to the sum of two semi-lower continuous (l.s.c), 
proper, convex functions with non-empty domain, and where one ele- 
ment is assumed differentiable with Lipschitz-continuous gradient [38J . 
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The proximal operator itself dates back to [95] and proximal methods 
themselves date back to [82j [92] . As of today, they have been extended 
to various settings [Ml USB EH1 H39| . In particular, instances of prox- 
imal methods are still applicable if the smoothness assumptions that 
we made on the loss are relaxed. For example, the Douglas-Rachford 
splitting algorithm applies as soon as the objective function to min- 
imize is only assumed l.s.c. proper convex, without any smoothness 
properties (although a l.s.c. convex function is continuous inside of its 
domain) . The augmented Lagrangian techniques (see [211 [381 ES] and 
numerous references therein) and more precisely their variants known 
as alternating-direction methods of multipliers are related to proximal 
methods via duality. These methods are in particular applicable to 
cases where several regularizations and constraints are mixed [89, 136J. 

For certain combination of losses and regularizations, dedicated 
methods have been proposed. This is the case for linear regression with 
the least absolute deviation (LAD) loss (also called £i-loss) with an £\- 
norm regularizer, which leads to a linear program [152J. This is also the 
case for algorithms designed for classical multiple kernel learning when 
the regularizer is the squared norm |111[ 1129} 1132] ; these methods are 
therefore not exactly comparable to the MKL algorithms presented in 
this monograph which apply to objective regularized by the unsquared 
norm (except for reweighted ^-schemes, based on variational formula- 
tions for the squared norm). 

In the context of proximal methods, the metric used to define the 
proximal operator can be modified by judicious rescaling operations, 
in order to fit better the geometry of the data |44| . Moreover, they 
can be mixed with Newton and quasi-Newton methods, for further 
acceleration (see, e.g., |120j ). 

Finally, from a broader outlook, our — a priori deterministic — 
optimization problem (jl.ip may also be tackled with stochastic op- 
timization approaches, which has been the focus of much recent re- 
search [211 E31 El ED fT26lfT53]. 
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Conclusions 



We have tried to provide in this monograph a unified view of spar- 
sity and structured sparsity as it can emerge when convex analysis and 
convex optimization are used as the conceptual basis to formalize re- 
spectively problems and algorithms. In that regards, we did not aim at 
exhaustivity and other paradigms are likely to provide complementary 
views. 

With convexity as a requirement however, using non-smooth norms 
as regularizers is arguably the most natural way to encode sparsity 
constraints. A main difficulty associated with these norms is that they 
are intrinsically non-differentiable; they are however fortunately also 
structured, so that a few concepts can be leveraged to manipulate and 
solve problems regularized with them. To summarize: 

— Fenchel-Legendre duality and the dual norm allow to compute 
subgradients, duality gaps and are also key to exploit sparsity algorith- 
mically via working set methods. More trivially, duality also provides 
an alternative formulation to the initial problem which is sometimes 
more tractable. 

— The proximal operator, which, when it can be computed effi- 
ciently (exactly or approximately) , allows one to treat the optimization 
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problem as if it were a smooth problem. 

— Quadratic variational formulations, which provide an alternative 
way to decouple the difficulties associated with the loss and the non- 
differentiability of the norm. 

Leveraging these different tools lead us to present and compare 
four families of algorithms for sparse methods: proximal methods, 
block-coordinate descent algorithms, reweighted-^2 schemes and the 
LARS /homotopy algorithm that are representative of the state of the 
art. The properties of these methods can be summarized as follows: 

— Proximal methods provide efficient and scalable algorithms that 
are applicable to a wide family of loss functions, that are simple to 
implement, compatible with many sparsity- inducing norms and often 
competitive with the other methods considered. 

— For the square loss, the homotopy method remains the fastest 
algorithm for (a) small and medium scale problems, since its complex- 
ity depends essentially on the size of the active sets, (b) cases with 
very correlated designs. It computes the whole path up to a certain 
sparsity level. Its main drawback is that it is difficult to implement ef- 
ficiently, and it is subject to numerical instabilities. On the other hand, 
coordinate descent and proximal algorithms are trivial to implement. 

— For smooth losses, block-coordinate descent provides one of the 
fastest algorithms but it is limited to separable regularizers. 

— For the square-loss and possibly sophisticated sparsity inducing 
regularizers, reweighted-^2 schemes provide generic algorithms, that are 
still pretty competitive compared to subgradient and interior point 
methods. For general losses, these methods currently require to solve 
iteratively ^-regularized problems and it would be desirable to relax 
this constraint. 

Of course, many learning problems are by essence non-convex and 
several approaches to inducing (sometimes more aggressively) sparsity 
are also non-convex. Beyond providing an overview of these methods to 
the reader as a complement to the convex formulations, we have tried 
to suggest that faced with non-convex non-differentiable and therefore 
potentially quite hard problems to solve, a good strategy is to try and 
reduce the problem to solving iteratively convex problems, since more 
stable algorithms are available and progress can be monitored with 
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duality gaps. 

Last but not least, duality suggests strongly that multiple kernel 
learning is in a sense the dual view to sparsity, and provides a natural 
way, via the "kernel trick", to extend sparsity to reproducing kernel 
Hilbert spaces. We have therefore illustrated throughout the text that 
rather than being a vague connection, this duality can be exploited 
both conceptually, leading to the idea of structured MKL, and algo- 
rithmically to kernelize all of the algorithms we considered so as to 
apply them in the MKL and RKHS settings. 
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